Sitemaps are great starting points if you need to find a list of all pages in a web site. Let us look at a little helper package that allows us to read the different sitemap formats with ease.
This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.
Ultimate sitemap parser
Because to the many ways you can create a sitemap for your web site, extracting the pages is a pain. Thankfully, there are tools to like the ultimate sitemap parser that can do it for us. You can install this package with this command:
1 |
pip install ultimate_sitemap_parser |
Extracting the pages
We can use these four lines to print a list of all pages of a web site:
1 2 3 4 |
from usp.tree import sitemap_tree_for_homepage tree = sitemap_tree_for_homepage('https://requests.readthedocs.io/') for page in tree.all_pages(): print(page.url) |
This produces a lot of log messages in the console, but at the end it lists all pages:
https://requests.readthedocs.io/en/stable/
https://requests.readthedocs.io/en/latest/
https://requests.readthedocs.io/en/v3.0.0/
https://requests.readthedocs.io/en/v2.9.1/
https://requests.readthedocs.io/en/v2.8.1/
https://requests.readthedocs.io/en/v2.7.0/
https://requests.readthedocs.io/en/v2.6.2/
https://requests.readthedocs.io/en/v2.5.3/
https://requests.readthedocs.io/en/v2.4.3/
https://requests.readthedocs.io/en/v2.3.0/
https://requests.readthedocs.io/en/v2.2.1/
https://requests.readthedocs.io/en/v2.1.0/
https://requests.readthedocs.io/en/v2.0.0/
https://requests.readthedocs.io/en/v1.2.3/
https://requests.readthedocs.io/en/v1.1.0/
https://requests.readthedocs.io/en/v1.0.4/
Supported formats
The ultimate sitemap parser supports these formats for sitemaps:
- XML sitemaps
- Google News sitemaps
- plain text sitemaps
- RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps
- Sitemaps linked from robots.txt
That covers probably everything you may encounter on the web when it comes to sitemaps.
Next
This package hides the complexity of the different sitemap formats nicely from us. As soon as we get the page objects, we can use them for more interesting things than printing them to the console. Next week we build on top of this module to create our own minimalistic link checker.
3 thoughts on “Python Friday #139: Parse Sitemaps With Python”