With requests, Beautiful Soup, and the ultimate sitemap parser we have everything together to create our own basic link checker. Let us combine the three parts to something useful and check if our web site has broken links.
This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.
Preparation
Before we can use the modules, we need to import them into our application:
1 2 3 4 |
import requests from bs4 import BeautifulSoup from usp.tree import sitemap_tree_for_homepage from typing import NamedTuple |
I not only want to know what links do not work, I want to be able to find and fix them on my web site. Therefore, I need to know on what page the broken link is. For that I use a named tuple like this one:
1 2 3 |
class Page(NamedTuple): url: str text: str |
(The link URL we want to check will be the key in the dictionary we introduce in a moment.)
Fetch the sitemap
We can use the ultimate sitemap parser to fetch all pages of our web site:
1 2 3 4 |
def read_sitemap(domain): tree = sitemap_tree_for_homepage(domain) pages = [page.url for page in tree.all_pages()] return pages |
Extract the links
We can now iterate through all our initial pages, use requests to load the page and Beautiful Soup to extract all links we want to check:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def find_links(pages): all_links = {} for page in pages: content = requests.get(page) soup = BeautifulSoup(content.text, "html.parser") links = soup.find_all("a") for link in links: if link.get("href").startswith("#"): continue if link.get("rel") is not None and "nofollow" in link.get("rel"): continue link_text = link.get_text().strip() link_target = link.get("href") source = Page(page, link_text) if not link_target.lower().startswith("http"): link_target = page + link_target if link_target in all_links: all_links[link_target].append(source) else: all_links[link_target] = [source] return all_links |
We store the links to check in a dictionary and use a list of our pages as its value. That allows us to only check a remote URL only once and keep enough context around to know all the places where we link to that remote URL.
We skip all links to other parts on the same page and stay away from links marked as “nofollow”.
Check the links
We now can iterate through the keys of our dictionary with all the links and use a HEAD request to check if there is something behind that URL. We need a little bit of error handling so that our link checker keeps running if a URL is not reachable:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def check_links(all_links): status = {} for key in all_links: try: print(f"working on {key}") page = requests.head(key, timeout=5) code = page.status_code except ConnectionRefusedError: code = "ConnectionRefusedError" except Exception: code = "Exception" if code in status: status[code].append(key) else: status[code] = [key] return status |
Every request will get us a status code or the exception name. Whatever we get, we use it as a key in another dictionary and reuse the idea of a list of URLs as the value part of the dictionary.
Report the result
As a final step we need to combine the result of the check with our pages that contain those links. For that we iterate through our dictionaries and their lists before we can write everything to a file:
1 2 3 4 5 6 7 8 |
def create_report(all_links, result): with open("report_link_status.txt", "w", encoding="utf-8") as f: for code in result: f.write(f"- {code}\n") for page in result[code]: f.write(f"\t - {page}\n") for source in all_links[page]: f.write(f"\t\t - {source.url} [{source.text}]\n") |
Orchestrate the different parts
The only thing left to do is to use the main block to orchestrate the different parts in the right order:
1 2 3 4 5 |
if __name__ == "__main__": pages = read_sitemap("https://requests.readthedocs.io/") all_links = find_links(pages) result = check_links(all_links) create_report(all_links, result) |
When we run the script, it should create us a report like this one and persist it in the file report_link_status.txt:
1 2 3 4 5 6 7 8 9 10 |
- 200 - https://requests.readthedocs.io/en/stable/user/install/#install - https://requests.readthedocs.io/en/stable/ [Installation] - https://pepy.tech/project/requests - https://requests.readthedocs.io/en/stable/ [] - https://pypi.org/project/requests/ - https://requests.readthedocs.io/en/stable/ [] - https://requests.readthedocs.io/en/stable/ [] - https://requests.readthedocs.io/en/stable/ [] - https://requests.readthedocs.io/en/stable/ [Requests @ PyPI] |
Extension points
If you want to continue with this basic link checker, you could address these shortcomings on your own:
- Change the logger for the ultimate sitemap parser to write into a file instead of the console
- For bigger sites you want to have a progress bar to see that your link checker is still alive
- Not all web sites support HEAD requests; for that a fallback to GET may be in order
- The minimalistic error handling has room for improvement
- You can turn the domain and the report file into parameters that you can change when you call your script in the command line
Next
The basic link checker we created in this post is a great example on how you can leverage different Python libraries to create something useful. You can take this code and add more features as you seam fit. You will be surprised how far you can go with that. Next week we look at a much lower level of network traffic as we try to read the TLS/SSL certificates to figure out when they reach their end of life.
1 thought on “Python Friday #140: Create a Basic Link Checker”