Python Friday #187: Extracting the NDC Talks Data From YouTube

It is time for a hands-on walkthrough to collect data and turn it into useful plots. For this exercise we are going to collect metadata from YouTube and run it through Seaborn to figure out if my hunch is correct or not.

This post is part of my journey to learn Python. You find the code for this post in my PythonFriday repository on GitHub.

 

Thesis: The talks at NDC Oslo 2023 where shorter than usual

I was not happy with the length of the talks I attended at this year’s NDC Oslo conference. Whenever I checked their YouTube channel, I saw more and more talks that where significantly short of their 60 minutes slot. Am I just ignoring all evidence that would prove my gut feeling wrong and the confirmation bias takes the better of me, or can I prove my point with data?

If I am right, I expect to find these 3 measurable results for talks at NDC Oslo 2023 compared with the previous years:

  1. The average duration is lower.
  2. There are more talks below 50 minutes.
  3. The median duration is lower.

 

YouTube as our data source

NDC Conferences publish the talks of their conferences in their YouTube channel. While there is a bit of advertising before and after the recorded talks, this overhead is neglectable, and we can use the video length as a proxy measurement for the talk duration.

The YouTube channel of NDC has currently 3301 videos of talks and promo videos for their conferences.

On the channel we not only find talks, but also promo videos for their conferences and workshops. With currently 3301 videos, we should find enough data points to (dis-)prove the thesis.

 

How to extract the metadata from YouTube?

There are 3 main approaches we can use for scraping YouTube:

  1. We use the official API and iterate through the videos in the channel.
  2. We automate a browser (with Selenium or Playwright), toggle through the endless-scroll and extract the data with Beautiful Soup
  3. We use a dedicated scraping tool for YouTube.

The API seams not to offer the metadata I am interested in, and the browser automation was already the topic of many posts in the past. Therefore, it is time for option 3.

 

Install Scrapetube

We can install Scrapetube with this command:

To run the examples, we need the Id of the NDC channel. We find the Id on the About page where we can copy it on the share icon:

Click on the share icon and then on Copy channel ID.

This should give us an Id like this one:

UCTdw38Cw6jcm0atBPA39a0Q

To check if everything is in place, we can run this example code with the Id we extracted:

This should give us a list of video Ids:

tKnNqftbT4Q
1DuxTlvmaNM
AYU0vw6IyUY
1K1LRzXeO_4
iDWwoz9ZUzw
I2NqMbKc8K4
cae0jXRrNno
vXLz_U4xzkc
Nyj2O8gHJZM

 

Fight with Scrapetube

While Scrapetube has a small API, an example on how to access the return values is nowhere to be found. We can overcome this obstacle with this little helper code to extract the keys and values of the result set:

Start it and then kill the program when you have a few results like these ones:

From this output we can work with dictionary- and list accessors to extract the data in which we are interested.

 

The YouTube data extractor

After a few rounds of experimenting with the return values of Scrapetube, I ended up with this extractor tool.

First, we need to import the libraries we use:

The duration of the talk is in the format of “hours:minutes:seconds“, what will be a pain to work with in Pandas. Therefore, we convert the duration into minutes and round the seconds up or down to a full minute:

The format of the title changed over time. We can start with “title – speaker – conference” and then try to match partial results to something useful:

With the helper methods in place, we can use Scrapetube to fetch the videos from the NDC channel, extract the data, create a DataFrame in Pandas, and export it as a CSV file:

When we run the extractor, we get a CSV file with around 3200 lines. We can check the conferences it could match with this code:

As expected, the many changes of the title format will require a lot of manual work to match it to the right conference. Nevertheless, for the conferences in the last few years it matched rather well, and we should be able to fix the data in a reasonable time frame.

 

Next

We started with our thesis of the shorter talks and found a data source that can (dis-) prove our assumption. With the help of Scrapetube we got the data from YouTube into a CSV file that we can turn next week into the source of our analysis work.

2 thoughts on “Python Friday #187: Extracting the NDC Talks Data From YouTube”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.