How to download Flash 10.2 video streams in Linux.

10 02 2011

Hey, people!

Just thought I’d post this little nugget of information, as it’s taken me a little while to work out how to do it. Before I start, I’ll mention that downloading copyrighted material may well be extremely illegal where you live, so make sure you only use this technique to download videos which contain your work entirely. This can be useful if you have, for example, lost your user details for a video upload site of some variety, and there is no facility to retrieve your videos without it.

Anyway, enough with the disclaimer, on with the hack…

As you may know, Flash Player used to store the temporary stream files in /tmp. They switched from this at the end of last year to storing them in the specific browser’s cache folder for reasons unbeknownst to the masses. This still made it pretty easy to locate any file you may have wished to download. After a recent update, I found that I was unable to locate the temporary caching folder anywhere.

My first step was to load the video in a browser, and check the output of the following command:
lsof | grep -i flash

This came out with a predictable, and very useful single line:
plugin-co 25646 n00b 17u REG 8,2 31286337 787220 /tmp/FlashXXepl6qa (deleted)

This showed me that there was a file descriptor open to a “deleted” file, /tmp/FlashXXepl6qa. I’m no programmer, so I have no idea how this works, but it seems that it’s adding chunks of data to this file descriptor (I imagine stored in RAM), while the file itself is technically nonexistent.

UPDATE (24/04/11):

Thanks to reader Raven Morris, I have some more information about what exactly is happening. The reason is that Linux does not have file locking, like Windows. Windows programs, when they open a file, will lock it, so no other programs can access it. In Linux, when a file is deleted, the operating system will keep track of it until there are no programs which have it open. Once the last program closes the last file descriptor to the file in question, the file becomes unrecoverable without relevant forensic tools, but until then, there is a record of it within the /proc/*/fd directory tree.

Anyway, the second field of the output tells us which process currently has the file descriptor open, and the fourth tells us which number the file descriptor has taken. This is all we need to access the file itself.

If you navigate to your /proc folder, you will see a bunch of folders all named numerically, including a folder which matches the number in the second field. Now navigate to this folder, then its subfolder “fd”. In this folder, you will see a whole selection of numbers. These relate to the file descriptors themselves. Run “ls -l” in this folder, and you will see that each of these numbers is linked to either pipes, sockets or files. Within this, the number from the fourth field will be symbolic linked to the /tmp/Flash* file we found before. To test that this is the right file, you can run it through mplayer or vlc (“mplayer filedescriptornumber”/”vlc filedescriptornumber”). If you’re having trouble finding the filename, try “ls -l | grep Flash”, as pointed out by reader Nobi.

Once the video is fully streamed, you can use a simple “cp” to copy the file from the file descriptor to a real location on your hard disk. (“cp filedescriptornumber ~/Videos/filename.flv”).

Another way to locate these files is to use the following command:
stat -c %N /proc/*/fd/* 2>&1|awk -F[\`\'] '/Flash/{print$2}'

I encourage you to play with the various sections of it to see how it works. If you’re having trouble getting it to work, make sure you have the apostrophes, backticks and spaces in the correct location.

Reader Robert submitted the following BASH alias, to automate the whole process. Here’s the script to insert into your bashrc:

cpflashvideo() { cd /proc/`lsof | awk ‘/Flash/&&/plugin-co/’ | awk //’{printf “%s”, $2}’`/fd/ && cp `ls -al | grep ‘\(deleted\)’ | awk ‘//{printf “%s”, $8}’` $* && cd – > /dev/null; }

UPDATE (13/07/11):

A couple of my friends were pondering the question of this, and came up with the following for those of you who have a few to many tabs open at one time… I’ve included a couple of versions, to show how this could be done using entirely awk, or a mixture of awk and sed.

This first one uses regexp matching in awk to ensure that the letter is stripped from the end of the fourth field:
for FILE in $(lsof -n | grep "Flash.*deleted" | awk '{printf "/proc/" $2 "/fd/"; sub(/[a-z]+/,"",$4); print $4}'); do
cp $FILE $(mktemp -u --suffix .flv $HOME/Videos/Video-XXX)
done

The next uses sed to strip off the last character of the fourth field:
for FILE in $(lsof -n|grep .tmp.Flash | awk '{print "/proc/" $2 "/fd/" $4}' | sed 's/.$//'); do
cp -v $FILE $(mktemp -u --suffix=.flv --tmpdir=$HOME/Videos/)
done

I hope this post is as interesting for others as I found it myself, and as always, direct any questions to me in the comments!

If you are a Spanish speaker, and have had trouble understanding this, Taringa.net user racsoprieto has translated the basics of the article here.

n00b





Wget an entire FTP folder from it’s index (RegExp Introduction)

4 12 2009

Hi, folks!

Just a couple of hours ago I was trying to download all the files in a folder on the OSUOSL FTP Slackware mirror with wget, and all I kept getting was the index.html file from the page, so I decided to write a little script to download any file linked in the index. I’m sure there are tools which can do this far more succinctly, but I thought this would be a good way to begin to explain the incredibly useful nature of regular expressions. Here’s how my script turned out…

for i in $(wget ftp://ftp.mirrorservice.org/sites/ftp.slackware.com/pub/slackware/slackware-13.0/source/a/tar/ -O - | grep "ftp://" | sed 's/^.*href=\"//g' | sed 's/\".*$//g'); do wget $i; done

Now let me break it down… The first command I wanted was one to download the index.html file and extract the necessary link data from it’s content. To download a file and then stream it’s contents to another command, use the wget syntax:

wget {URL} -O - | {COMMAND}

First of all I piped the file’s contents to grep, which ignores any line which does not contain the phrase “ftp://”. This will ensures that we are only working with lines which contain a hyperlink to a file, ignoring all the extraneous HTML tags. The next process was to remove any of the surrounding HTML from the links. A link in an HTML document will always be preceeded by <a href=". To remove this part, I used sed. There are other tools which would work in a similar manner, but I find sed to be a great way of learning regular expressions, and I find it's syntax to be very easy to understand. The command to remove anything up to and including the href=" is as follows:

sed 's/^.*href=\"//g'

To anybody who doesn’t understand regular expression syntax, this looks like a jumble of characters. I’ll explain briefly how it works… Sed’s syntax for basic search and replace is as follows:

sed 's/{REG EXP OR TEXT TO SEARCH FOR}/{TEXT TO REPLACE WITH/g'

The regexp to match in our example is ^.*href=\”

^ means “From the beginning of the line”.
.* is a wildcard, denoting absolutely any sequence of characters.
href=\” describes the exact text string we want to have as the final characters in the match. The \ is an escape character to force the ” to be treated as a character.

In our command, the second part of the sed command is empty. This means that any text which matches the regexp will just be removed. The regexp will match from the beginning of any line which contains href=” up to the “.

Now each line processed will read something along the lines of:

ftp://ftp.mirrorservice.org:21/sites/ftp.slackware.com/pub/slackware/slackware-13.0/source/a/tar/rmt.8.gz">rmt.8.gz
(2429
bytes)

The second use of sed will remove anything occurring after the filename. It reads like so:

sed 's/\".*$//g'

This is used in the same way as the previous use of sed. The regexp to match is \”.*$.

$ means “End of line”, so this matches everything from the first occurrence of ” up to the end of each line. The output should now be nothing but a list of links. The final part is to wrap the output in a loop, and hand each line to wget.

Anyway, I hope this has been informative, and I’ll no doubt post some more soon!

n00b