Wget an entire FTP folder from it’s index (RegExp Introduction)

4 12 2009

Hi, folks!

Just a couple of hours ago I was trying to download all the files in a folder on the OSUOSL FTP Slackware mirror with wget, and all I kept getting was the index.html file from the page, so I decided to write a little script to download any file linked in the index. I’m sure there are tools which can do this far more succinctly, but I thought this would be a good way to begin to explain the incredibly useful nature of regular expressions. Here’s how my script turned out…

for i in $(wget ftp://ftp.mirrorservice.org/sites/ftp.slackware.com/pub/slackware/slackware-13.0/source/a/tar/ -O - | grep "ftp://" | sed 's/^.*href=\"//g' | sed 's/\".*$//g'); do wget $i; done

Now let me break it down… The first command I wanted was one to download the index.html file and extract the necessary link data from it’s content. To download a file and then stream it’s contents to another command, use the wget syntax:

wget {URL} -O - | {COMMAND}

First of all I piped the file’s contents to grep, which ignores any line which does not contain the phrase “ftp://”. This will ensures that we are only working with lines which contain a hyperlink to a file, ignoring all the extraneous HTML tags. The next process was to remove any of the surrounding HTML from the links. A link in an HTML document will always be preceeded by <a href=". To remove this part, I used sed. There are other tools which would work in a similar manner, but I find sed to be a great way of learning regular expressions, and I find it's syntax to be very easy to understand. The command to remove anything up to and including the href=" is as follows:

sed 's/^.*href=\"//g'

To anybody who doesn’t understand regular expression syntax, this looks like a jumble of characters. I’ll explain briefly how it works… Sed’s syntax for basic search and replace is as follows:

sed 's/{REG EXP OR TEXT TO SEARCH FOR}/{TEXT TO REPLACE WITH/g'

The regexp to match in our example is ^.*href=\”

^ means “From the beginning of the line”.
.* is a wildcard, denoting absolutely any sequence of characters.
href=\” describes the exact text string we want to have as the final characters in the match. The \ is an escape character to force the ” to be treated as a character.

In our command, the second part of the sed command is empty. This means that any text which matches the regexp will just be removed. The regexp will match from the beginning of any line which contains href=” up to the “.

Now each line processed will read something along the lines of:

ftp://ftp.mirrorservice.org:21/sites/ftp.slackware.com/pub/slackware/slackware-13.0/source/a/tar/rmt.8.gz">rmt.8.gz
(2429
bytes)

The second use of sed will remove anything occurring after the filename. It reads like so:

sed 's/\".*$//g'

This is used in the same way as the previous use of sed. The regexp to match is \”.*$.

$ means “End of line”, so this matches everything from the first occurrence of ” up to the end of each line. The output should now be nothing but a list of links. The final part is to wrap the output in a loop, and hand each line to wget.

Anyway, I hope this has been informative, and I’ll no doubt post some more soon!

n00b

Advertisements

Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: