Wget is the most powerful download manager which gives you tons of switches ranging from recursive, spider for various purposes. Though, from my point of view, Wget is slightly legacy/outdated which sometimes you need to use it with other tools to maximize its facilities.
Enough background about Wget, now let’s move to the vBulletin case study. I wanted to download contents of a vBulletin forum using Wget, but there are some problems need to be addressed. I list the steps that should be taken to successfully manage to download the contents of the forum.
- Login to the forum, save session cookie.
- Add agent to Wget.
- Add all appropriate switches.
Before move to the details of steps, let’s talk about the ominous “403 Forbidden” error message in Wget. Usually if you try to download a website link using Wget, you will get “403 Forbidden” error. This error is the most common and annoying that I have ever seen. To bypass it you generally need to add agent and cookie in Wget. The same fact is true in vBulletin and it is more crucial since you have to save session cookie related to login.
Very well, now let’s move to the details.
The first step is to login and save session id cookie. The best approach I have found is to use Google Chrome cookie.txt extension. Basically, this extension allows you to save the cookie of your link in Wget friendly format which saves you from many troubles and headaches. You can download the extension from this link.
After installing the extension, you need to login to your vBulletin forum and click on the extension and save the content of cookie.txt to a file called “cookie.txt”. Awesome, now you can bypass login page easily, but still you get 403 Forbidden error.
The second step is to add agent string, this is very simple in Wget and you can use -U switch to define your browser agent. If you don’t know which agent string to use, have a look at following links,
Finally, you need to put all pieces of puzzle together to download contents of vBulletin forum. The final result is something like this,
wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U 'Mozilla/5.0 (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0' -x --load-cookies cookies.txt 'http://myforum.com'
Here, I don’t explain about the switches because I believe Wget manpage explained much better than any other resources. If want to know about switches details, take a look at Wget manpage.
With above command, you should be able to download the contents of any vBulletin forum.
Now, let’s move to another interesting topic which is downloading attachments of a link of vBulletin forum (i.e. pictures). Surprisingly, you can use the above command to download any attachments from a given vBulletin link. For instance, I want to download all the attach images of a forum in vBulletin. To do so, I simply pass the page that contains all attachment links to the command,
Consequently, the command should look like this,
wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U 'Mozilla/5.0 (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0' -x --load-cookies cookies.txt 'http://myforum.com/xyz.php?do=showattachments&t=1234'
Unfortunately, the downloaded files do not have any extension and need to rename the files. To do so, you can refer to this post I have written before. The final script is something like below (in this case add
.jpg extension to all files).
#!/bin/bash cnt=1 for name in attachment.php*; do new=$(printf "%04d.jpg" "$cnt") mv -- "$name" "$new" let cnt=cnt+1 done