Python Friday #139: Parse Sitemaps With Python

Sitemaps are great starting points if you need to find a list of all pages in a web site. Let us look at a little helper package that allows us to read the different sitemap formats with ease.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

 

Ultimate sitemap parser

Because to the many ways you can create a sitemap for your web site, extracting the pages is a pain. Thankfully, there are tools to like the ultimate sitemap parser that can do it for us. You can install this package with this command:

 

Extracting the pages

We can use these four lines to print a list of all pages of a web site:

This produces a lot of log messages in the console, but at the end it lists all pages:

https://requests.readthedocs.io/en/stable/
https://requests.readthedocs.io/en/latest/
https://requests.readthedocs.io/en/v3.0.0/
https://requests.readthedocs.io/en/v2.9.1/
https://requests.readthedocs.io/en/v2.8.1/
https://requests.readthedocs.io/en/v2.7.0/
https://requests.readthedocs.io/en/v2.6.2/
https://requests.readthedocs.io/en/v2.5.3/
https://requests.readthedocs.io/en/v2.4.3/
https://requests.readthedocs.io/en/v2.3.0/
https://requests.readthedocs.io/en/v2.2.1/
https://requests.readthedocs.io/en/v2.1.0/
https://requests.readthedocs.io/en/v2.0.0/
https://requests.readthedocs.io/en/v1.2.3/
https://requests.readthedocs.io/en/v1.1.0/
https://requests.readthedocs.io/en/v1.0.4/

 

Supported formats

The ultimate sitemap parser supports these formats for sitemaps:

That covers probably everything you may encounter on the web when it comes to sitemaps.

 

Next

This package hides the complexity of the different sitemap formats nicely from us. As soon as we get the page objects, we can use them for more interesting things than printing them to the console. Next week we build on top of this module to create our own minimalistic link checker.

3 thoughts on “Python Friday #139: Parse Sitemaps With Python”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.