Python Friday #138: Parsing HTML With Beautiful Soup

Getting HTML pages into our Python application is a straightforward task, thanks to requests. But what do we do now? Let us look how Beautiful Soup can help us getting the data out of the often messed up HTML code.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

Installation

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It offers us an idiomatic way to navigate, search and modify the HTML tree. We can install Beautiful Soup with this command:

pip install -U beautifulsoup4

1	pip install -U beautifulsoup4

Fixing HTML

HTML is often not well-formatted. That is not a problem for most browsers, but it is a big obstacle for us when we try to access the data in our code. Thanks to Beautiful Soup, we can fix HTML on the fly and turn a fragment of HTML with no closing tags into something correct:

>>> from bs4 import BeautifulSoup

>>> html_doc = '<p><i><a href="abc">Hello'
>>> soup = BeautifulSoup(html_doc, 'html.parser')

>>> print(soup.prettify()) 
<p>
 <i>
  <a href="abc">
   Hello
  </a>
 </i>
</p>

>>> from bs4 import BeautifulSoup

>>> html_doc = '<p><i><a href="abc">Hello'

>>> soup = BeautifulSoup(html_doc, 'html.parser')

>>> print(soup.prettify())

<p>

<i>

Hello

</a>

</i>

</p>

From requests to soup

We can combine requests and Beautiful Soup to load a website without saving the HTML to disk first:

>>> import requests
>>> from bs4 import BeautifulSoup

>>> r = requests.get('https://ImproveAndRepeat.com/') 
>>> soup = BeautifulSoup(r.text, 'html.parser')

>>> import requests

>>> from bs4 import BeautifulSoup

>>> r = requests.get('https://ImproveAndRepeat.com/')

>>> soup = BeautifulSoup(r.text, 'html.parser')

Access named parts in a HTML document

We can use special fields to access the main parts of a HTML document. We can get the title-tag through the title property:

>>> soup.title      
<title>Improve &amp; Repeat</title>

>>> soup.title.text                 
'Improve & Repeat'

>>> soup.title

<title>Improve & Repeat</title>

>>> soup.title.text

'Improve & Repeat'

If we want the header, we can use the header property:

>>> soup.head       
<head>
<meta charset="utf-8"/>
<link href="https://gmpg.org/xfn/11" rel="profile"/>
<style id="jetpack-boost-critical-css">@media all{
...

>>> soup.head

<head>

...

If we do only care for the body, we can get it with the property of the same name:

>>> soup.body     
<body class="home blog wp-embed-responsive post-image-below-header 
post-image-aligned-center left-sidebar nav-below-header separate-containers 
fluid-header active-footer-widgets-0 nav-search-enabled nav-aligned-left 
header-aligned-left dropdown-hover" data-rsssl="1" 
itemscope="" itemtype="https://schema.org/Blog">
..

>>> soup.body

<body class="home blog wp-embed-responsive post-image-below-header

post-image-aligned-center left-sidebar nav-below-header separate-containers

fluid-header active-footer-widgets-0 nav-search-enabled nav-aligned-left

header-aligned-left dropdown-hover" data-rsssl="1"

itemscope="" itemtype="https://schema.org/Blog">

That special access works with all tags, but it will only give us the first one in the HTML document:

>>> soup.body.a
<a class="screen-reader-text skip-link" href="#content" 
title="Skip to content">Skip to content</a>
>>>

>>> soup.body.a

<a class="screen-reader-text skip-link" href="#content"

title="Skip to content">Skip to content</a>

>>>

Find all links

If we want to find more than the first value, we can use the find_all() method. To get all links, we need to ask for all “a” HTML tags:

>>> for link in soup.find_all('a'):
        print(link.get('href'))

#content
https://automattic.com/cookies/
https://improveandrepeat.com/
#
https://ImproveAndRepeat.com/
https://improveandrepeat.com/about/
#
https://improveandrepeat.com/2022/08/the-split-phase-refactoring/
https://improveandrepeat.com/author/jg/
https://improveandrepeat.com/tag/testing/
https://improveandrepeat.com/2022/08/python-friday-136-date-and-time-in-python-part-3-dateutil/
...

>>> for link in soup.find_all('a'):

print(link.get('href'))

#content

https://automattic.com/cookies/

https://improveandrepeat.com/

https://ImproveAndRepeat.com/

https://improveandrepeat.com/about/

https://improveandrepeat.com/2022/08/the-split-phase-refactoring/

https://improveandrepeat.com/author/jg/

https://improveandrepeat.com/tag/testing/

https://improveandrepeat.com/2022/08/python-friday-136-date-and-time-in-python-part-3-dateutil/

...

Extract tables

Given we have a web site with a list of contacts like this:

>>> contacts = """
<html>
<head>
	<title></title>
</head>
<body>
	<h1>Contact</h1>
	<table>
		<tr>
			<th>name</th>
			<th>phone</th>
			<th>country</th>
			<th>email</th>
		</tr>
		<tr>
			<td>Harlan Gibbs</td>
			<td>1-351-733-8608</td>
			<td>Chile</td>
			<td>nec.quam.curabitur@outlook.net</td>
		</tr>
		<tr>
			<td>Marny Ashley</td>
			<td>1-760-796-7925</td>
			<td>South Korea</td>
			<td>pellentesque@google.edu</td>
		</tr>
		<tr>
			<td>Chava Dixon</td>
			<td>1-828-824-7717</td>
			<td>Switzerland</td>
			<td>malesuada.fames.ac@yahoo.ca</td>
		</tr>
	</table>
 </body>
 </html>
 """

>>> contacts = """

<html>

<head>

</head>

<body>

<h1>Contact</h1>

<table>

<tr>

<th>phone</th>

<th>country</th>

<th>email</th>

</tr>

<tr>

<td>Harlan Gibbs</td>

<td>Chile</td>

<td>[email protected]</td>

</tr>

<tr>

<td>Marny Ashley</td>

<td>South Korea</td>

<td>[email protected]</td>

</tr>

<tr>

<td>Chava Dixon</td>

<td>Switzerland</td>

<td>[email protected]</td>

</tr>

</table>

</body>

</html>

"""

The table we want to access looks like this:

name	phone	country	email
Harlan Gibbs	1-351-733-8608	Chile	[email protected]
Marny Ashley	1-760-796-7925	South Korea	[email protected]
Chava Dixon	1-828-824-7717	Switzerland	[email protected]

We can use Beautiful Soup to find the table, iterate through the rows and print the column with the email addresses:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(contacts, 'html.parser')
>>> table = soup.body.table
>>> for row in table.find_all('tr'):
      columns = row.find_all('td')
      if columns:
          print(columns[3].text)

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(contacts, 'html.parser')

>>> table = soup.body.table

>>> for row in table.find_all('tr'):

columns = row.find_all('td')

if columns:

print(columns[3].text)

This will give us this output:

[email protected]
[email protected]
[email protected]

If we want to get all columns, we can iterate through them like this:

>>> for row in table.find_all('tr'):
       print(50*'-') 
       headers = row.find_all('th')
       for header in headers:
           print(f'**{header.text}**')
       columns = row.find_all('td')
       for column in columns:
           print(column.text)

>>> for row in table.find_all('tr'):

print(50*'-')

headers = row.find_all('th')

for header in headers:

print(f'**{header.text}**')

columns = row.find_all('td')

for column in columns:

print(column.text)

The output now looks like this:

————————————————–
**name**
**phone**
**country**
**email**
————————————————–
Harlan Gibbs
1-351-733-8608
Chile
[email protected]
————————————————–
Marny Ashley
1-760-796-7925
South Korea
[email protected]
————————————————–
Chava Dixon
1-828-824-7717
Switzerland
[email protected]

Beautiful Soup is a great help when we work with HTML. It fixes the syntax errors in HTML for us so that we can focus on the interesting part of extracting data. Next week we look at a way to use a sitemap to find the interesting pages in a web site.

Python Friday #138: Parsing HTML With Beautiful Soup

Installation

Fixing HTML

From requests to soup

Access named parts in a HTML document

Find all links

Extract tables

Next

Like this:

Related

1 thought on “Python Friday #138: Parsing HTML With Beautiful Soup”

Leave a Comment Cancel reply

Installation

Fixing HTML

From requests to soup

Access named parts in a HTML document

Find all links

Extract tables

Next

Share this:

Like this:

Related

1 thought on “Python Friday #138: Parsing HTML With Beautiful Soup”

Leave a Comment Cancel reply