Python Friday #138: Parsing HTML With Beautiful Soup

Getting HTML pages into our Python application is a straightforward task, thanks to requests. But what do we do now? Let us look how Beautiful Soup can help us getting the data out of the often messed up HTML code.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

 

Installation

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It offers us an idiomatic way to navigate, search and modify the HTML tree. We can install Beautiful Soup with this command:

 

Fixing HTML

HTML is often not well-formatted. That is not a problem for most browsers, but it is a big obstacle for us when we try to access the data in our code. Thanks to Beautiful Soup, we can fix HTML on the fly and turn a fragment of HTML with no closing tags into something correct:

 

From requests to soup

We can combine requests and Beautiful Soup to load a website without saving the HTML to disk first:

 

Access named parts in a HTML document

We can use special fields to access the main parts of a HTML document. We can get the title-tag through the title property:

If we want the header, we can use the header property:

If we do only care for the body, we can get it with the property of the same name:

That special access works with all tags, but it will only give us the first one in the HTML document:

 

Find all links

If we want to find more than the first value, we can use the find_all() method. To get all links, we need to ask for all “a” HTML tags:

 

Extract tables

Given we have a web site with a list of contacts like this:

The table we want to access looks like this:

name phone country email
Harlan Gibbs 1-351-733-8608 Chile [email protected]
Marny Ashley 1-760-796-7925 South Korea [email protected]
Chava Dixon 1-828-824-7717 Switzerland [email protected]

We can use Beautiful Soup to find the table, iterate through the rows and print the column with the email addresses:

This will give us this output:

[email protected]
[email protected]
[email protected]

If we want to get all columns, we can iterate through them like this:

The output now looks like this:

————————————————–
**name**
**phone**
**country**
**email**
————————————————–
Harlan Gibbs
1-351-733-8608
Chile
[email protected]
————————————————–
Marny Ashley
1-760-796-7925
South Korea
[email protected]
————————————————–
Chava Dixon
1-828-824-7717
Switzerland
[email protected]

 

Next

Beautiful Soup is a great help when we work with HTML. It fixes the syntax errors in HTML for us so that we can focus on the interesting part of extracting data. Next week we look at a way to use a sitemap to find the interesting pages in a web site.

1 thought on “Python Friday #138: Parsing HTML With Beautiful Soup”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.