Python Friday #140: Create a Basic Link Checker

With requests, Beautiful Soup, and the ultimate sitemap parser we have everything together to create our own basic link checker. Let us combine the three parts to something useful and check if our web site has broken links.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

Preparation

Before we can use the modules, we need to import them into our application:

import requests
from bs4 import BeautifulSoup
from usp.tree import sitemap_tree_for_homepage
from typing import NamedTuple

import requests

from bs4 import BeautifulSoup

from usp.tree import sitemap_tree_for_homepage

from typing import NamedTuple

I not only want to know what links do not work, I want to be able to find and fix them on my web site. Therefore, I need to know on what page the broken link is. For that I use a named tuple like this one:

class Page(NamedTuple):
    url: str
    text: str

class Page(NamedTuple):

url: str

text: str

(The link URL we want to check will be the key in the dictionary we introduce in a moment.)

Fetch the sitemap

We can use the ultimate sitemap parser to fetch all pages of our web site:

def read_sitemap(domain):
    tree = sitemap_tree_for_homepage(domain)
    pages = [page.url for page in tree.all_pages()]
    return pages

def read_sitemap(domain):

tree = sitemap_tree_for_homepage(domain)

pages = [page.url for page in tree.all_pages()]

return pages

Extract the links

We can now iterate through all our initial pages, use requests to load the page and Beautiful Soup to extract all links we want to check:

def find_links(pages):
    all_links = {}

    for page in pages:
        content = requests.get(page)
        soup = BeautifulSoup(content.text, "html.parser")
        links = soup.find_all("a")

        for link in links:
            if link.get("href").startswith("#"):
                continue

            if  link.get("rel") is not None and "nofollow" in link.get("rel"):
                continue

            link_text = link.get_text().strip()
            link_target = link.get("href")
            source = Page(page, link_text)

            if not link_target.lower().startswith("http"):
                link_target = page + link_target

            if link_target in all_links:
                all_links[link_target].append(source)
            else:
                all_links[link_target] = [source]
                
    return all_links

def find_links(pages):

all_links = {}

for page in pages:

content = requests.get(page)

soup = BeautifulSoup(content.text, "html.parser")

links = soup.find_all("a")

for link in links:

if link.get("href").startswith("#"):

continue

if link.get("rel") is not None and "nofollow" in link.get("rel"):

continue

link_text = link.get_text().strip()

link_target = link.get("href")

source = Page(page, link_text)

if not link_target.lower().startswith("http"):

link_target = page + link_target

if link_target in all_links:

all_links[link_target].append(source)

else:

all_links[link_target] = [source]

return all_links

We store the links to check in a dictionary and use a list of our pages as its value. That allows us to only check a remote URL only once and keep enough context around to know all the places where we link to that remote URL.

We skip all links to other parts on the same page and stay away from links marked as “nofollow”.

Check the links

We now can iterate through the keys of our dictionary with all the links and use a HEAD request to check if there is something behind that URL. We need a little bit of error handling so that our link checker keeps running if a URL is not reachable:

def check_links(all_links):
    status = {}

    for key in all_links:
        try:
            print(f"working on {key}")
            page = requests.head(key, timeout=5)
            code = page.status_code
        except ConnectionRefusedError:
            code = "ConnectionRefusedError"
        except Exception:
            code = "Exception"

        if code in status:
            status[code].append(key)
        else:
            status[code] = [key]

    return status

def check_links(all_links):

status = {}

for key in all_links:

try:

print(f"working on {key}")

page = requests.head(key, timeout=5)

code = page.status_code

except ConnectionRefusedError:

code = "ConnectionRefusedError"

except Exception:

code = "Exception"

if code in status:

status[code].append(key)

else:

status[code] = [key]

return status

Every request will get us a status code or the exception name. Whatever we get, we use it as a key in another dictionary and reuse the idea of a list of URLs as the value part of the dictionary.

Report the result

As a final step we need to combine the result of the check with our pages that contain those links. For that we iterate through our dictionaries and their lists before we can write everything to a file:

def create_report(all_links, result):
    with open("report_link_status.txt", "w", encoding="utf-8") as f:
        for code in result:
            f.write(f"- {code}\n")
            for page in result[code]:
                f.write(f"\t - {page}\n")
                for source in all_links[page]:
                    f.write(f"\t\t - {source.url} [{source.text}]\n")

def create_report(all_links, result):

with open("report_link_status.txt", "w", encoding="utf-8") as f:

for code in result:

f.write(f"- {code}\n")

for page in result[code]:

f.write(f"\t - {page}\n")

for source in all_links[page]:

f.write(f"\t\t - {source.url} [{source.text}]\n")

Orchestrate the different parts

The only thing left to do is to use the main block to orchestrate the different parts in the right order:

if __name__ == "__main__":
    pages = read_sitemap("https://requests.readthedocs.io/")
    all_links = find_links(pages)
    result = check_links(all_links)
    create_report(all_links, result)

if __name__ == "__main__":

pages = read_sitemap("https://requests.readthedocs.io/")

all_links = find_links(pages)

result = check_links(all_links)

create_report(all_links, result)

When we run the script, it should create us a report like this one and persist it in the file report_link_status.txt:

- 200
	 - https://requests.readthedocs.io/en/stable/user/install/#install
		 - https://requests.readthedocs.io/en/stable/ [Installation]
	 - https://pepy.tech/project/requests
		 - https://requests.readthedocs.io/en/stable/ []
	 - https://pypi.org/project/requests/
		 - https://requests.readthedocs.io/en/stable/ []
		 - https://requests.readthedocs.io/en/stable/ []
		 - https://requests.readthedocs.io/en/stable/ []
		 - https://requests.readthedocs.io/en/stable/ [Requests @ PyPI]

- 200

- https://requests.readthedocs.io/en/stable/user/install/#install

- https://requests.readthedocs.io/en/stable/ [Installation]

- https://pepy.tech/project/requests

- https://requests.readthedocs.io/en/stable/ []

- https://pypi.org/project/requests/

- https://requests.readthedocs.io/en/stable/ []

- https://requests.readthedocs.io/en/stable/ [Requests @ PyPI]

Extension points

If you want to continue with this basic link checker, you could address these shortcomings on your own:

Change the logger for the ultimate sitemap parser to write into a file instead of the console
For bigger sites you want to have a progress bar to see that your link checker is still alive
Not all web sites support HEAD requests; for that a fallback to GET may be in order
The minimalistic error handling has room for improvement
You can turn the domain and the report file into parameters that you can change when you call your script in the command line

The basic link checker we created in this post is a great example on how you can leverage different Python libraries to create something useful. You can take this code and add more features as you seam fit. You will be surprised how far you can go with that. Next week we look at a much lower level of network traffic as we try to read the TLS/SSL certificates to figure out when they reach their end of life.

Python Friday #140: Create a Basic Link Checker

Preparation

Fetch the sitemap

Extract the links

Check the links

Report the result

Orchestrate the different parts

Extension points

Next

Like this:

Related

1 thought on “Python Friday #140: Create a Basic Link Checker”

Leave a Comment Cancel reply

Preparation

Fetch the sitemap

Extract the links

Check the links

Report the result

Orchestrate the different parts

Extension points

Next

Share this:

Like this:

Related

1 thought on “Python Friday #140: Create a Basic Link Checker”

Leave a Comment Cancel reply