Introduction
While wget
is often used for basic file downloads, it also provides several advanced features and lesser-known options that can help you handle more complex tasks. In this second part, we’ll cover some of the more advanced functionality that can further enhance your usage of wget
in Linux.
TL;DR
You can find a shorter cheat sheet version of this article here.
Table of contents
Open Table of contents
- Downloading Files via FTP
- Downloading Only Specific File Types
- Adjusting Download Timings and Retries
- Logging and Debugging Downloads
- Mirroring a Website with Timestamping
- Limiting the Number of Connections per Server
- Using a Proxy with wget
- Post Data to a Web Form
- Setting Custom HTTP Headers
- Recursive Retrieval with Quotas
- Download Over SSL/TLS
- Automatic Retries with Exponential Backoff
- Conclusion
Downloading Files via FTP
In addition to HTTP and HTTPS, wget
supports FTP (File Transfer Protocol), which is useful for accessing files stored on FTP servers. If the server requires authentication, you can specify a username and password directly in the URL:
wget ftp://username:password@ftp.learntheshell.com/path/to/file.zip
For anonymous FTP access, simply omit the username and password:
wget ftp://ftp.learntheshell.com/path/to/file.zip
Downloading Only Specific File Types
When mirroring websites or directories, you might only want to download specific types of files (e.g., PDFs, images, or documents). You can filter the download using the --accept
or --reject
option to specify file extensions:
wget -r --accept jpg,png https://learntheshell.com/
This example downloads only .jpg
and .png
files from the website. Conversely, you can reject specific file types:
wget -r --reject mp4,avi https://learntheshell.com/
This command will exclude any .mp4
and .avi
files from the download.
Adjusting Download Timings and Retries
For more control over how wget
handles failed downloads or unstable connections, you can adjust the number of retries and the time between them. The --tries
option sets how many times wget
will attempt to download a file before giving up:
wget --tries=10 https://learntheshell.com/sample.zip
If the connection is unreliable, you can increase the time between retries using --wait
to introduce a delay between attempts:
wget --tries=10 --wait=5 https://learntheshell.com/sample.zip
Here, wget
will retry up to 10 times with a 5-second delay between each attempt. This can help prevent server overload or avoid issues with rate-limited servers.
Logging and Debugging Downloads
If you’re automating downloads or troubleshooting issues, wget
provides logging options to help keep track of what’s happening. Use the -o
option to log output to a file:
wget -o download.log https://learntheshell.com/sample.zip
If you want more detailed information, including headers and debugging output, use the -d
option to enable debug mode:
wget -d https://learntheshell.com/sample.zip
Debugging mode provides insight into the HTTP requests and responses, which can be useful when diagnosing connectivity issues or troubleshooting failures.
Mirroring a Website with Timestamping
For long-term projects that require you to repeatedly download the same site, you can use timestamping with --timestamping
(-N
). This ensures that only newer files or files that have changed since the last download are fetched:
wget -r --timestamping https://learntheshell.com/
This is useful when maintaining a local mirror of a website, as it prevents unnecessary downloads of unchanged files.
Limiting the Number of Connections per Server
To prevent overwhelming a server, wget
allows you to limit the number of simultaneous connections. This is especially useful when mirroring websites or when the server has limitations on how many connections it accepts. You can use the --wait
and --random-wait
options to introduce a delay between requests:
wget -r --wait=1 --random-wait https://learntheshell.com/
The --random-wait
option introduces a random delay between 0.5 to 1.5 times the specified wait time (in seconds), reducing the likelihood of being blocked by the server for excessive requests.
Using a Proxy with wget
If your network setup requires a proxy to access the internet, you can use wget
with HTTP or SOCKS proxies. Set the proxy using environment variables:
export http_proxy=http://proxyserver:port/
export https_proxy=https://proxyserver:port/
To bypass the proxy for specific URLs, use the --no-proxy
option:
wget --no-proxy https://localhost/sample.zip
This is useful if certain URLs (like local network addresses) should not go through the proxy.
Post Data to a Web Form
wget
can also be used to interact with web forms by sending POST requests. This is useful for automating data submissions. The --post-data
option allows you to pass form data to a URL:
wget --post-data="username=user&password=pass" https://learntheshell.com/login
This will simulate a form submission to a login page. The data is sent in the form of key-value pairs (key=value
), and multiple fields are separated by an ampersand (&
).
Setting Custom HTTP Headers
Sometimes you may need to send custom headers along with your request. You can do this using the --header
option. This is useful when downloading content from APIs or websites that require specific headers (e.g., authentication tokens, content types):
wget --header="Authorization: Bearer your_token_here" https://api.learntheshell.com/data
You can also send multiple headers by repeating the --header
option:
wget --header="Authorization: Bearer your_token_here" --header="Content-Type: application/json" https://api.learntheshell.com/data
Recursive Retrieval with Quotas
When recursively downloading a website, you might want to limit the total download size to avoid consuming too much bandwidth or disk space. You can set a download quota using the --quota
option:
wget -r --quota=100m https://learntheshell.com/
This will stop the download once the total size exceeds 100 MB. You can specify the quota in bytes (B
), kilobytes (k
), megabytes (m
), or gigabytes (g
).
Download Over SSL/TLS
wget
supports downloading over SSL/TLS (HTTPS), and it can be configured to handle various SSL settings. If you need to download a file from an HTTPS site with a self-signed certificate, use the --no-check-certificate
option to bypass certificate validation:
wget --no-check-certificate https://learntheshell.com/sample.zip
This is useful when working with development servers that use self-signed certificates but should be used cautiously for security reasons.
Automatic Retries with Exponential Backoff
When encountering transient network issues, it’s often beneficial to retry with an increasing delay between attempts. wget
has a built-in option for exponential backoff, which gradually increases the wait time between retries:
wget --tries=5 --waitretry=2 https://learntheshell.com/sample.zip
In this example, wget
will retry up to five times, with an exponentially increasing delay starting at two seconds.
Conclusion
The wget
command offers a wide range of advanced options that make it suitable for everything from simple file downloads to complex tasks like mirroring websites, interacting with APIs, and handling authentication. By mastering these lesser-known features, you can optimize your use of wget
for a variety of situations, making it a highly versatile tool for any Linux user.