Python Friday #140: Create a Basic Link Checker

With requests, Beautiful Soup, and the ultimate sitemap parser we have everything together to create our own basic link checker. Let us combine the three parts to something useful and check if our web site has broken links.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

 

Preparation

Before we can use the modules, we need to import them into our application:

I not only want to know what links do not work, I want to be able to find and fix them on my web site. Therefore, I need to know on what page the broken link is. For that I use a named tuple like this one:

(The link URL we want to check will be the key in the dictionary we introduce in a moment.)

 

Fetch the sitemap

We can use the ultimate sitemap parser to fetch all pages of our web site:

 

Extract the links

We can now iterate through all our initial pages, use requests to load the page and Beautiful Soup to extract all links we want to check:

We store the links to check in a dictionary and use a list of our pages as its value. That allows us to only check a remote URL only once and keep enough context around to know all the places where we link to that remote URL.

We skip all links to other parts on the same page and stay away from links marked as “nofollow”.

 

Check the links

We now can iterate through the keys of our dictionary with all the links and use a HEAD request to check if there is something behind that URL. We need a little bit of error handling so that our link checker keeps running if a URL is not reachable:

Every request will get us a status code or the exception name. Whatever we get, we use it as a key in another dictionary and reuse the idea of a list of URLs as the value part of the dictionary.

 

Report the result

As a final step we need to combine the result of the check with our pages that contain those links. For that we iterate through our dictionaries and their lists before we can write everything to a file:

 

Orchestrate the different parts

The only thing left to do is to use the main block to orchestrate the different parts in the right order:

When we run the script, it should create us a report like this one and persist it in the file report_link_status.txt:

 

Extension points

If you want to continue with this basic link checker, you could address these shortcomings on your own:

  • Change the logger for the ultimate sitemap parser to write into a file instead of the console
  • For bigger sites you want to have a progress bar to see that your link checker is still alive
  • Not all web sites support HEAD requests; for that a fallback to GET may be in order
  • The minimalistic error handling has room for improvement
  • You can turn the domain and the report file into parameters that you can change when you call your script in the command line

 

Next

The basic link checker we created in this post is a great example on how you can leverage different Python libraries to create something useful. You can take this code and add more features as you seam fit. You will be surprised how far you can go with that. Next week we look at a much lower level of network traffic as we try to read the TLS/SSL certificates to figure out when they reach their end of life.

1 thought on “Python Friday #140: Create a Basic Link Checker”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.