Extract all images of a webpage with its hyperlinks using Python

Python is one the most powerful general-purpose scripting languages which is used vastly by many giant organizations such as Google, Yahoo!, Cern and NASA. The main objective of creating this language is to write the code in fewer lines than other programming languages like C as well as keeping readability of the code easy. Python also supports multi paradigms programming which means that you could develop your code in structural form, object oriented, modular, etc. Furthermore, it is multi platforms programming language and you could write your code for instance, in Linux and then run and compile it on other platform such as MS Windows without any changes or little modification. In other word, Python is cross platform programming language.

Personally, I think Python is really powerful language and also high level. It will have very bright and good future. Additionally, when I develop some simple applications with Python I did realize the simplicity of language which is considerably reflects in line of code reduction. By contrast of C/C++ which you need to write many functions from scratch or manipulate some libraries in Python for everything many libraries and classes are available and you just need to call them. This aspect reduces the line of code considerably and make programming very very easy. In this post, I will demonstrate how to fetch and save all images of a single webpage with saving images hyperlinks in separate text file.

 If you are C/C++ or even Java programmer it sounds really tough to develop such application to extract and save all images of a webpage and also saves links of image. But do not worry in Python doing all mentioned task is peace of cake and easy. The only things that you need to write less than 20 lines of code. The code of the application is available in the following section. You also can download the code from GitHub via this link.

 import urllib, re
 source = urllib.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()
 f = open('out.txt', 'w')
 for link in re.findall('http://sports.cbsimg.net/images/nba/logos/30x30/[A-Z]*.png', source):
     print >> f, link # or f.write('...\n')
     actually_download = True
     if actually_download:
         filename = link.split('/')[-1]
         urllib.urlretrieve(link, filename)
 f.close()

In the above application first you need to import two libraries which are named as urllib and re which the first one provides a high-level interface for fetching data across the World Wide Web and the second one is for regular expression. For more information about the libraries refer to links at the end of the article.

After importing needed documents, you need to open your favorite URL that contains images to fetch, here is Cbssports website. Then with using for loop and regular expression all pictures (here *.png) will be fetched from the URL.

Keep in your mind that, the second URL in the application is the location that pictures are saved. For getting the location of images in you case you just need right click to a picture file and choose ‘view image’ option to redirect to the source page of the image and then copy the link of the URL.

Additionally, the source links of the images are saved in a file which is named as output and writing links to the file is repeated in the loop. Finally last section of the loop downloads the image files. If you just want to fetch the files and not downloading them for any purposes you could just change ‘True’ to ‘False’ in front of ‘actually_download’.

As you have seen, it is really really easy to fetch and save all images from URL in Python, now just imagine how it could be possible if you want to do it with other languages such as C/C++.

For more information please refer to following links.
Official Python website.
Re library documentation page.
Urllib library documentation page.

send your idea and information to kasra.madadipouya@geeksweb.eu.org

Leave a Reply