Python Friday #102: Parse Atom and RSS Feeds With Feedparser

RSS feeds are a nice way to get informed when your favourite bloggers post new content. Let’s find out how we can work with those feeds in Python.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

 

Use Feedparser

Feedparser is a parser for all kinds of feeds (Atom, CDF, and nine different versions of RSS). Whatever format your feed is in, feedparser gives you a unified way to work with the data. Take a look at the different date formats feedparser handles to get an idea how much work it takes from your shoulders.

You can install feedparser with this command:

 

Read the feed metadata

All we need to work with feedparser is the URL to a feed:

If we run this code, feedparser shows us the basic metadata that comes with the feed itself:

Title: PythonFriday – Improve & Repeat
Site: https://improveandrepeat.com
Generator: https://wordpress.org/?v=5.8.2
Last updated: Mon, 22 Nov 2021 18:14:13 +0000

This can be useful if you want to display the name or the base URL of the blog. The field for the last update looks helpful, but not all feed generators set it correctly (WordPress sets a date that may be weeks behind the last post).

 

Walk through the posts

The main work you want to do with feedparser is to iterate through the posts. You find them in the entries property:

This lists the title, the date of publication, the URL of the post, and the summary:

Python Friday #1: Let’s Learn Python (Fri, 03 Jan 2020 19:00:25 +0000) – https://improveandrepeat.com/2020/01/python-friday-1-lets-learn-python/

With the New Year comes a new series of posts in my blog. I want to lean Python and try to document the steps as I go along. I hope this will help me better to remember all the obstacles I hit on the way. Let us see
how this experiment turns out.   Motivation … Read more

We can change the output to fit our needs. For all date fields (like published, updated, created) you find a *_parsed field that contains the date put into a standard Python 9-tuple as it is used by the Python time module. This allows us to convert the date in a more useful format:

This gives us a much more condensed list of posts:

Python Friday #10: Classes (Part 2) (2020-03-06)
Python Friday #9: Classes (Part 1) (2020-02-28)
Python Friday #8: Modules (2020-02-21)
Python Friday #7: Functions (2020-02-14)
Python Friday #6: Control Structures (2020-02-07)
Python Friday #5: Strings (2020-01-31)
Python Friday #4: Lists, Dictionaries, Sets & Tuples (2020-01-24)
Python Friday #3: Numbers, Booleans & None (2020-01-17)
Python Friday #2: Resources to Learn Python (2020-01-10)
Python Friday #1: Let’s Learn Python (2020-01-03)

 

Find more interesting fields

The documentation for feedparser has the full list of properties you get for a post. But when you are coding you can use this code snipped to list all fields of a post and pick the ones you like:

The post object should have these fields:

title
title_detail
links
link
comments
authors
author
author_detail
published
published_parsed
tags
id
guidislink
summary
summary_detail
content
wfw_commentrss
slash_comments
post-id

 

Conclusion

If you want to work with Atom or RSS feeds, do not build a parser for yourself. Leverage all the work feedparser does and build your application around the more interesting parts of the problem you want to solve.

1 thought on “Python Friday #102: Parse Atom and RSS Feeds With Feedparser”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.