Python Friday #139: Parse Sitemaps With Python

Sitemaps are great starting points if you need to find a list of all pages in a web site. Let us look at a little helper package that allows us to read the different sitemap formats with ease.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

Ultimate sitemap parser

Because to the many ways you can create a sitemap for your web site, extracting the pages is a pain. Thankfully, there are tools to like the ultimate sitemap parser that can do it for us. You can install this package with this command:

pip install ultimate_sitemap_parser

1	pip install ultimate_sitemap_parser

Extracting the pages

We can use these four lines to print a list of all pages of a web site:

from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage('https://requests.readthedocs.io/')
for page in tree.all_pages():
    print(page.url)

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://requests.readthedocs.io/')

for page in tree.all_pages():

print(page.url)

This produces a lot of log messages in the console, but at the end it lists all pages:

https://requests.readthedocs.io/en/stable/
https://requests.readthedocs.io/en/latest/
https://requests.readthedocs.io/en/v3.0.0/
https://requests.readthedocs.io/en/v2.9.1/
https://requests.readthedocs.io/en/v2.8.1/
https://requests.readthedocs.io/en/v2.7.0/
https://requests.readthedocs.io/en/v2.6.2/
https://requests.readthedocs.io/en/v2.5.3/
https://requests.readthedocs.io/en/v2.4.3/
https://requests.readthedocs.io/en/v2.3.0/
https://requests.readthedocs.io/en/v2.2.1/
https://requests.readthedocs.io/en/v2.1.0/
https://requests.readthedocs.io/en/v2.0.0/
https://requests.readthedocs.io/en/v1.2.3/
https://requests.readthedocs.io/en/v1.1.0/
https://requests.readthedocs.io/en/v1.0.4/

Supported formats

The ultimate sitemap parser supports these formats for sitemaps:

That covers probably everything you may encounter on the web when it comes to sitemaps.

This package hides the complexity of the different sitemap formats nicely from us. As soon as we get the page objects, we can use them for more interesting things than printing them to the console. Next week we build on top of this module to create our own minimalistic link checker.

3 thoughts on “Python Friday #139: Parse Sitemaps With Python”

Pingback: Python Friday #138: Parsing HTML With Beautiful Soup – Improve & Repeat
Pingback: Python Friday #200 - Improve & Repeat
Pingback: Python Friday #140: Create a Basic Link Checker - Improve & Repeat

Python Friday #139: Parse Sitemaps With Python

Ultimate sitemap parser

Extracting the pages

Supported formats

Next

Like this:

Related

3 thoughts on “Python Friday #139: Parse Sitemaps With Python”

Leave a Comment Cancel reply

Ultimate sitemap parser

Extracting the pages

Supported formats

Next

Share this:

Like this:

Related

3 thoughts on “Python Friday #139: Parse Sitemaps With Python”

Leave a Comment Cancel reply