Hey guys, today I’ll be showing you a simple example on how to use Beautiful Soup to scrape data. To be more specific we will be looking at an open public forum and just extract all the thread’s title on the first page followed by some meta-data like author, post date, etc.
What is BeautifulSoup?
Well in simple words it’s a Python module that can be used for webscraping, data Mining, and for all your parsing needs.
Why should I consider using it?
To be honest there are a lot of parsers out there for Python…lxml, SGML, etc. but the main feature of Beautiful Soup is that it’s quite easy to use and developer friendly; you can easily construct a few expressions and with just a few lines of code you can get your job done. Aim of Beautiful Soup is mainly to be developer friendly, but performance wise may be it’s not the best. If performance is your main aim, then other C based parsers like lxml might fit your bill.
Anyway let’s proceed with using Beautiful Soup!
The script that I’ve released here is coded in such a way so that it’s not intrusive, and only makes a single request just like any other user who is viewing the forum.
First of all you will need Python 2.7 or earlier installed along with the appropriate version of “Beautiful Soup”.
View/Download the script from here –> http://pastebin.com/3wrnDw12
Link to the forum we want to scrape –> http://forum.intern0t.net/perl-python/
As you can see the code is pretty simple and self explanatory. Beautiful Soup makes it quite easy to refer to different types of tags using various attribute values and we can easily play around with different nodes to get to the data we want.
If you run the script this is what you’ll see:
See how simple it is? It only took 5-6 lines of code related to Beautiful Soup to achieve that. This is something that made me fall in love Python; it’s simply too elegant and simple.
In the past before I knew Beautiful Soup, I used regular expressions mainly to get the job done. Regular expressions are great for complex tasks, you can even use Regex with in Beautiful Soup itself if you feel like doing so.
Anyway for those of you who are new to Beautiful Soup I would recommend playing around with the code and learning it through your personal experience.
Don’t forget to look at the Official Beautiful Soup Documentation as reference –> http://www.crummy.com/software/BeautifulSoup/documentation.html