Getting HTML pages into our Python application is a straightforward task, thanks to requests. But what do we do now? Let us look how Beautiful Soup can help us getting the data out of the often messed up HTML code.
This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.
Installation
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It offers us an idiomatic way to navigate, search and modify the HTML tree. We can install Beautiful Soup with this command:
1 |
pip install -U beautifulsoup4 |
Fixing HTML
HTML is often not well-formatted. That is not a problem for most browsers, but it is a big obstacle for us when we try to access the data in our code. Thanks to Beautiful Soup, we can fix HTML on the fly and turn a fragment of HTML with no closing tags into something correct:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
>>> from bs4 import BeautifulSoup >>> html_doc = '<p><i><a href="abc">Hello' >>> soup = BeautifulSoup(html_doc, 'html.parser') >>> print(soup.prettify()) <p> <i> <a href="abc"> Hello </a> </i> </p> |
From requests to soup
We can combine requests and Beautiful Soup to load a website without saving the HTML to disk first:
1 2 3 4 5 |
>>> import requests >>> from bs4 import BeautifulSoup >>> r = requests.get('https://ImproveAndRepeat.com/') >>> soup = BeautifulSoup(r.text, 'html.parser') |
Access named parts in a HTML document
We can use special fields to access the main parts of a HTML document. We can get the title-tag through the title property:
1 2 3 4 5 |
>>> soup.title <title>Improve & Repeat</title> >>> soup.title.text 'Improve & Repeat' |
If we want the header, we can use the header property:
1 2 3 4 5 6 |
>>> soup.head <head> <meta charset="utf-8"/> <link href="https://gmpg.org/xfn/11" rel="profile"/> <style id="jetpack-boost-critical-css">@media all{ ... |
If we do only care for the body, we can get it with the property of the same name:
1 2 3 4 5 6 7 |
>>> soup.body <body class="home blog wp-embed-responsive post-image-below-header post-image-aligned-center left-sidebar nav-below-header separate-containers fluid-header active-footer-widgets-0 nav-search-enabled nav-aligned-left header-aligned-left dropdown-hover" data-rsssl="1" itemscope="" itemtype="https://schema.org/Blog"> .. |
That special access works with all tags, but it will only give us the first one in the HTML document:
1 2 3 4 |
>>> soup.body.a <a class="screen-reader-text skip-link" href="#content" title="Skip to content">Skip to content</a> >>> |
Find all links
If we want to find more than the first value, we can use the find_all() method. To get all links, we need to ask for all “a” HTML tags:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
>>> for link in soup.find_all('a'): print(link.get('href')) #content https://automattic.com/cookies/ https://improveandrepeat.com/ # https://ImproveAndRepeat.com/ https://improveandrepeat.com/about/ # https://improveandrepeat.com/2022/08/the-split-phase-refactoring/ https://improveandrepeat.com/author/jg/ https://improveandrepeat.com/tag/testing/ https://improveandrepeat.com/2022/08/python-friday-136-date-and-time-in-python-part-3-dateutil/ ... |
Extract tables
Given we have a web site with a list of contacts like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
>>> contacts = """ <html> <head> <title></title> </head> <body> <h1>Contact</h1> <table> <tr> <th>name</th> <th>phone</th> <th>country</th> <th>email</th> </tr> <tr> <td>Harlan Gibbs</td> <td>1-351-733-8608</td> <td>Chile</td> <td>[email protected]</td> </tr> <tr> <td>Marny Ashley</td> <td>1-760-796-7925</td> <td>South Korea</td> <td>[email protected]</td> </tr> <tr> <td>Chava Dixon</td> <td>1-828-824-7717</td> <td>Switzerland</td> <td>[email protected]</td> </tr> </table> </body> </html> """ |
The table we want to access looks like this:
name | phone | country | |
---|---|---|---|
Harlan Gibbs | 1-351-733-8608 | Chile | [email protected] |
Marny Ashley | 1-760-796-7925 | South Korea | [email protected] |
Chava Dixon | 1-828-824-7717 | Switzerland | [email protected] |
We can use Beautiful Soup to find the table, iterate through the rows and print the column with the email addresses:
1 2 3 4 5 6 7 |
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(contacts, 'html.parser') >>> table = soup.body.table >>> for row in table.find_all('tr'): columns = row.find_all('td') if columns: print(columns[3].text) |
This will give us this output:
If we want to get all columns, we can iterate through them like this:
1 2 3 4 5 6 7 8 |
>>> for row in table.find_all('tr'): print(50*'-') headers = row.find_all('th') for header in headers: print(f'**{header.text}**') columns = row.find_all('td') for column in columns: print(column.text) |
The output now looks like this:
————————————————–
**name**
**phone**
**country**
**email**
————————————————–
Harlan Gibbs
1-351-733-8608
Chile
[email protected]
————————————————–
Marny Ashley
1-760-796-7925
South Korea
[email protected]
————————————————–
Chava Dixon
1-828-824-7717
Switzerland
[email protected]
Next
Beautiful Soup is a great help when we work with HTML. It fixes the syntax errors in HTML for us so that we can focus on the interesting part of extracting data. Next week we look at a way to use a sitemap to find the interesting pages in a web site.
1 thought on “Python Friday #138: Parsing HTML With Beautiful Soup”