How to Use the wget Linux Command to Download Web Pages

 How To Download A Website Using wget

For this guide, you will learn how to download this linux blog.

wget www.everydaylinuxuser.com

It is worth creating your own folder on your machine using the mkdir command and then moving into the folder using the cd command.

For example:

mkdir everydaylinuxuser
cd everydaylinuxuser
wget www.everydaylinuxuser.com

The result is a single index.html file. On its own, this file is fairly useless as the content is still pulled from Google and the images and stylesheets are still all held on Google.

To download the full site and all the pages you can use the following command:

wget -r www.everydaylinuxuser.com

This downloads the pages recursively up to a maximum of 5 levels deep.

5 levels deep might not be enough to get everything from the site. You can use the -l switch to set the number of levels you wish to go to as follows:

wget -r -l10 www.everydaylinuxuser.com

If you want infinite recursion you can use the following:

wget -r -l inf www.everydaylinuxuser.com

You can also replace the inf with 0 which means the same thing.

There is still one more problem. You might get all the pages locally but all the links in the pages still point to their original place. It is therefore not possible to click locally between the links on the pages.

You can get around this problem by using the -k switch which converts all the links on the pages to point to their locally downloaded equivalent as follows:

wget -r -k www.everydaylinuxuser.com

If you want to get a complete mirror of a website you can simply use the following switch which takes away the necessity for using the -r -k and -l switches.

wget -m www.everydaylinuxuser.com

Therefore if you have your own website you can make a complete backup using this one simple command.
Run wget As A Background Command

You can get wget to run as a background command leaving you able to get on with your work in the terminal window whilst the files download.

Simply use the following command:

wget -b www.everydaylinuxuser.com

You can of course combine switches. To run the wget command in the background whilst mirroring the site you would use the following command:

wget -b -m www.everydaylinuxuser.com

You can simplify this further as follows:

wget -bm www.everydaylinuxuser.com
Logging

If you are running the wget command in the background you won't see any of the normal messages that it sends to the screen.

You can get all of those messages sent to a log file so that you can check on progress at any time using the tail command.

To output information from the wget command to a log file use the following command:

wget -o /path/to/mylogfile www.everydaylinuxuser.com

The reverse, of course, is to require no logging at all and no output to the screen. To omit all output use the following command:

wget -q www.everydaylinuxuser.com
Download From Multiple Sites

You can set up an input file to download from many different sites.

Open up a file using your favorite editor or even the cat command and simply start listing the sites or links to download from on each line of the file.

Save the file and then run the following wget command:

wget -i /path/to/inputfile

Apart from backing up your own website or maybe finding something to download to read on the train, it is unlikely that you will want to download an entire website.

You are more likely to download a single URL with images or perhaps download files such as zip files, ISO files or image files.

With that in mind you don't want to have to type the following into the input file as it is time consuming:

    http://www.myfileserver.com/file1.zip
    http://www.myfileserver.com/file2.zip
    http://www.myfileserver.com/file3.zip

If you know the base URL is always going to be the same you can just specify the following in the input file:

    file1.zip
    file2.zip
    file3.zip

You can then provide the base URL as part of the wget command as follows:

wget -B http://www.myfileserver.com -i /path/to/inputfile
Retry Options

If you have set up a queue of files to download within an input file and you leave your computer running all night to download the files you will be fairly annoyed when you come down in the morning to find that it got stuck on the first file and has been retrying all night.

You can specify the number of retries using the following switch:

wget -t 10 -i /path/to/inputfile

You might wish to use the above command in conjunction with the -T switch which allows you to specify a timeout in seconds as follows:

wget -t 10 -T 10 -i /path/to/inputfile

The above command will retry 10 times and will try to connect for 10 seconds for each link in the file.

It is also fairly annoying when you have partially downloaded 75% of a 4 gigabyte file on a slow broadband connection only for your connection to drop out.

You can use wget to retry from where it stopped downloading by using the following command:

wget -c www.myfileserver.com/file1.zip

If you are hammering a server the host might not like it too much and might either block or just kill your requests.

You can specify a wait period which specifies how long to wait between each retrieval as follows:

wget -w 60 -i /path/to/inputfile

The above command will wait 60 seconds between each download. This is useful if you are downloading lots of files from a single source.

Some web hosts might spot the frequency however and will block you anyway. You can make the wait period random to make it look like you aren't using a program as follows:

wget --random-wait -i /path/to/inputfile
Protecting Download Limits

Many internet service providers still apply download limits for your broadband usage, especially if you live outside of a city.

You may want to add a quota so that you don't blow that download limit. You can do that in the following way:

wget -q 100m -i /path/to/inputfile

Note that the -q command won't work with a single file. So if you download a file that is 2 gigabytes in size, using -q 1000m will not stop the file downloading.

The quota is only applied when recursively downloading from a site or when using an input file.
Getting Through Security

Some sites require you to log in to be able to access the content you wish to download.

You can use the following switches to specify the username and password.

wget --user=yourusername --password=yourpassword

Note on a multi user system if somebody runs the ps command they will be able to see your username and password.
Other Download Options

By default the -r switch will recursively download the content and will create directories as it goes.

You can get all the files to download to a single folder using the following switch:

wget -nd -r

The opposite of this is to force the creation of directories which can be achieved using the following command:

wget -x -r
How To Download Certain File Types

If you want to download recursively from a site but you only want to download a specific file type such as an mp3 or an image such as a png you can use the following syntax:

wget -A "*.mp3" -r

The reverse of this is to ignore certain files. Perhaps you don't want to download executables. In this case, you would use the following syntax:

wget -R "*.exe" -r
Cliget

There is a Firefox add-on called cliget. You can add this to Firefox in the following way.

Visit https://addons.mozilla.org/en-US/firefox/addon/cliget/ and click the "add to Firefox" button.

Click the install button when it appears. You will required to restart Firefox.

To use cliget visit a page or file you wish to download and right click. A context menu will appear called cliget and there will be options to "copy to wget" and "copy to curl".

Click the "copy to wget" option and open a terminal window and then right click and paste. The appropriate wget command will be pasted into the window.

Basically, this saves you having to type the command yourself.
Summary

The wget command as a huge number of options and switches.

It is worth therefore reading the manual page for wget by typing the following into a terminal window:

man wget

Comments

Popular posts from this blog

Setting Up Out-of-Office Messages in Outlook

GSuite