It is time for a hands-on walkthrough to collect data and turn it into useful plots. For this exercise we are going to collect metadata from YouTube and run it through Seaborn to figure out if my hunch is correct or not.
This post is part of my journey to learn Python. You find the code for this post in my PythonFriday repository on GitHub.
Thesis: The talks at NDC Oslo 2023 where shorter than usual
I was not happy with the length of the talks I attended at this year’s NDC Oslo conference. Whenever I checked their YouTube channel, I saw more and more talks that where significantly short of their 60 minutes slot. Am I just ignoring all evidence that would prove my gut feeling wrong and the confirmation bias takes the better of me, or can I prove my point with data?
If I am right, I expect to find these 3 measurable results for talks at NDC Oslo 2023 compared with the previous years:
- The average duration is lower.
- There are more talks below 50 minutes.
- The median duration is lower.
YouTube as our data source
NDC Conferences publish the talks of their conferences in their YouTube channel. While there is a bit of advertising before and after the recorded talks, this overhead is neglectable, and we can use the video length as a proxy measurement for the talk duration.
On the channel we not only find talks, but also promo videos for their conferences and workshops. With currently 3301 videos, we should find enough data points to (dis-)prove the thesis.
How to extract the metadata from YouTube?
There are 3 main approaches we can use for scraping YouTube:
- We use the official API and iterate through the videos in the channel.
- We automate a browser (with Selenium or Playwright), toggle through the endless-scroll and extract the data with Beautiful Soup
- We use a dedicated scraping tool for YouTube.
The API seams not to offer the metadata I am interested in, and the browser automation was already the topic of many posts in the past. Therefore, it is time for option 3.
Install Scrapetube
We can install Scrapetube with this command:
1 |
pip install scrapetube |
To run the examples, we need the Id of the NDC channel. We find the Id on the About page where we can copy it on the share icon:
This should give us an Id like this one:
UCTdw38Cw6jcm0atBPA39a0Q
To check if everything is in place, we can run this example code with the Id we extracted:
1 2 3 4 5 6 |
import scrapetube videos = scrapetube.get_channel("UCTdw38Cw6jcm0atBPA39a0Q") for video in videos: print(video['videoId']) |
This should give us a list of video Ids:
tKnNqftbT4Q
1DuxTlvmaNM
AYU0vw6IyUY
1K1LRzXeO_4
iDWwoz9ZUzw
I2NqMbKc8K4
cae0jXRrNno
vXLz_U4xzkc
Nyj2O8gHJZM
Fight with Scrapetube
While Scrapetube has a small API, an example on how to access the return values is nowhere to be found. We can overcome this obstacle with this little helper code to extract the keys and values of the result set:
1 2 3 4 |
videos = scrapetube.get_channel("UCTdw38Cw6jcm0atBPA39a0Q") for video in videos: for key, value in video.items(): print(f"{key}: {value}") |
Start it and then kill the program when you have a few results like these ones:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
videoId: 1DuxTlvmaNM thumbnail: {'thumbnails': [{'url': 'https://i.ytimg.com/vi/1DuxTlvmaNM/hqdefault.jpg?sqp=-oaymwEbCKgBEF5IVfKriqkDDggBFQAAiEIYAXABwAEG&rs=AOn4CLDFygXFvyqgCo3zDNEkjqejIaVRhA', 'width': 168, 'height': 94}, {'url': 'https://i.ytimg.com/vi/1DuxTlvmaNM/hqdefault.jpg?sqp=-oaymwEbCMQBEG5IVfKriqkDDggBFQAAiEIYAXABwAEG&rs=AOn4CLCK6T9WZDQY9Zi1XE9umML2TkXXuw', 'width': 196, 'height': 110}, {'url': 'https://i.ytimg.com/vi/1DuxTlvmaNM/hqdefault.jpg?sqp=-oaymwEcCPYBEIoBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLABQ_abyMnk0SM2K_4AyEwQ57zPyg', 'width': 246, 'height': 138}, {'url': 'https://i.ytimg.com/vi/1DuxTlvmaNM/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLCBR0RZumYJa6w6Lb_fL_DLiTcBOg', 'width': 336, 'height': 188}]} title: {'runs': [{'text': 'Managing Kubernetes the GitOps way with Flux - Jeff French - NDC Oslo 2023'}], 'accessibility': {'accessibilityData': {'label': 'Managing Kubernetes the GitOps way with Flux - Jeff French - NDC Oslo 2023 by NDC Conferences 4 days ago 1 hour 1,003 views'}}} descriptionSnippet: {'runs': [{'text': 'In this session, we will explore the benefits and best practices of using the Flux GitOps framework for managing Kubernetes clusters. We will start with a brief overview of GitOps and the Flux...'}]} publishedTimeText: {'simpleText': '4 days ago'} lengthText: {'accessibility': {'accessibilityData': {'label': '1 hour, 38 seconds'}}, 'simpleText': '1:00:38'} viewCountText: {'simpleText': '1,003 views'} navigationEndpoint: {'clickTrackingParams': 'CO0BENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZWhhVQ1RkdzM4Q3c2amNtMGF0QlBBMzlhMFGaAQYQ8jgY4Ac=', 'commandMetadata': {'webCommandMetadata': {'url': '/watch?v=1DuxTlvmaNM', 'webPageType': 'WEB_PAGE_TYPE_WATCH', 'rootVe': 3832}}, 'watchEndpoint': {'videoId': '1DuxTlvmaNM', 'watchEndpointSupportedOnesieConfig': {'html5PlaybackOnesieConfig': {'commonConfig': {'url': 'https://rr3---sn-1gi7znes.googlevideo.com/initplayback?source=youtube&oeis=1&c=WEB&oad=3200&ovd=3200&oaad=11000&oavd=11000&ocs=700&oewis=1&oputc=1&ofpcc=1&msp=1&odepv=1&id=d43bb14e5be668d3&ip=213.142.178.228&initcwndbps=1770000&mt=1689624085&oweuc=&pxtags=Cg4KAnR4Egg0NTE3MDA1OA&rxtags=Cg4KAnR4Egg0NTE3MDA1OA%2CCg4KAnR4Egg0NTE3MDA1OQ'}}}}} trackingParams: CO0BENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZQNPRmd_lqeyd1AE= showActionMenu: False shortViewCountText: {'accessibility': {'accessibilityData': {'label': '1K views'}}, 'simpleText': '1K views'} menu: {'menuRenderer': {'items': [{'menuServiceItemRenderer': {'text': {'runs': [{'text': 'Add to queue'}]}, 'icon': {'iconType': 'ADD_TO_QUEUE_TAIL'}, 'serviceEndpoint': {'clickTrackingParams': 'CPIBEP6YBBgFIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True}}, 'signalServiceEndpoint': {'signal': 'CLIENT_SIGNAL', 'actions': [{'clickTrackingParams': 'CPIBEP6YBBgFIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'addToPlaylistCommand': {'openMiniplayer': True, 'videoId': '1DuxTlvmaNM', 'listType': 'PLAYLIST_EDIT_LIST_TYPE_QUEUE', 'onCreateListCommand': {'clickTrackingParams': 'CPIBEP6YBBgFIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/playlist/create'}}, 'createPlaylistServiceEndpoint': {'videoIds': ['1DuxTlvmaNM'], 'params': 'CAQ%3D'}}, 'videoIds': ['1DuxTlvmaNM']}}]}}, 'trackingParams': 'CPIBEP6YBBgFIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}, {'menuServiceItemDownloadRenderer': {'serviceEndpoint': {'clickTrackingParams': 'CPEBENGqBRgGIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'offlineVideoEndpoint': {'videoId': '1DuxTlvmaNM', 'onAddCommand': {'clickTrackingParams': 'CPEBENGqBRgGIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'getDownloadActionCommand': {'videoId': '1DuxTlvmaNM', 'params': 'CAI%3D'}}}}, 'trackingParams': 'CPEBENGqBRgGIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}, {'menuServiceItemRenderer': {'text': {'runs': [{'text': 'Share'}]}, 'icon': {'iconType': 'SHARE'}, 'serviceEndpoint': {'clickTrackingParams': 'CO0BENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/share/get_share_panel'}}, 'shareEntityServiceEndpoint': {'serializedShareEntity': 'CgsxRHV4VGx2bWFOTQ%3D%3D', 'commands': [{'clickTrackingParams': 'CO0BENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'openPopupAction': {'popup': {'unifiedSharePanelRenderer': {'trackingParams': 'CPABEI5iIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'showLoadingSpinner': True}}, 'popupType': 'DIALOG', 'beReused': True}}]}}, 'trackingParams': 'CO0BENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}], 'trackingParams': 'CO0BENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'accessibility': {'accessibilityData': {'label': 'Action menu'}}}} thumbnailOverlays: [{'thumbnailOverlayTimeStatusRenderer': {'text': {'accessibility': {'accessibilityData': {'label': '1 hour, 38 seconds'}}, 'simpleText': '1:00:38'}, 'style': 'DEFAULT'}}, {'thumbnailOverlayToggleButtonRenderer': {'isToggled': False, 'untoggledIcon': {'iconType': 'WATCH_LATER'}, 'toggledIcon': {'iconType': 'CHECK'}, 'untoggledTooltip': 'Watch later', 'toggledTooltip': 'Added', 'untoggledServiceEndpoint': {'clickTrackingParams': 'CO8BEPnnAxgBIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/browse/edit_playlist'}}, 'playlistEditEndpoint': {'playlistId': 'WL', 'actions': [{'addedVideoId': '1DuxTlvmaNM', 'action': 'ACTION_ADD_VIDEO'}]}}, 'toggledServiceEndpoint': {'clickTrackingParams': 'CO8BEPnnAxgBIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/browse/edit_playlist'}}, 'playlistEditEndpoint': {'playlistId': 'WL', 'actions': [{'action': 'ACTION_REMOVE_VIDEO_BY_VIDEO_ID', 'removedVideoId': '1DuxTlvmaNM'}]}}, 'untoggledAccessibility': {'accessibilityData': {'label': 'Watch later'}}, 'toggledAccessibility': {'accessibilityData': {'label': 'Added'}}, 'trackingParams': 'CO8BEPnnAxgBIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}, {'thumbnailOverlayToggleButtonRenderer': {'untoggledIcon': {'iconType': 'ADD_TO_QUEUE_TAIL'}, 'toggledIcon': {'iconType': 'PLAYLIST_ADD_CHECK'}, 'untoggledTooltip': 'Add to queue', 'toggledTooltip': 'Added', 'untoggledServiceEndpoint': {'clickTrackingParams': 'CO4BEMfsBBgCIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True}}, 'signalServiceEndpoint': {'signal': 'CLIENT_SIGNAL', 'actions': [{'clickTrackingParams': 'CO4BEMfsBBgCIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'addToPlaylistCommand': {'openMiniplayer': True, 'videoId': '1DuxTlvmaNM', 'listType': 'PLAYLIST_EDIT_LIST_TYPE_QUEUE', 'onCreateListCommand': {'clickTrackingParams': 'CO4BEMfsBBgCIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/playlist/create'}}, 'createPlaylistServiceEndpoint': {'videoIds': ['1DuxTlvmaNM'], 'params': 'CAQ%3D'}}, 'videoIds': ['1DuxTlvmaNM']}}]}}, 'untoggledAccessibility': {'accessibilityData': {'label': 'Add to queue'}}, 'toggledAccessibility': {'accessibilityData': {'label': 'Added'}}, 'trackingParams': 'CO4BEMfsBBgCIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}, {'thumbnailOverlayNowPlayingRenderer': {'text': {'runs': [{'text': 'Now playing'}]}}}] videoId: AYU0vw6IyUY thumbnail: {'thumbnails': [{'url': 'https://i.ytimg.com/vi/AYU0vw6IyUY/hqdefault.jpg?sqp=-oaymwEbCKgBEF5IVfKriqkDDggBFQAAiEIYAXABwAEG&rs=AOn4CLCU_3rIVvaQzXp6F0Ef7fcvIZx5xw', 'width': 168, 'height': 94}, {'url': 'https://i.ytimg.com/vi/AYU0vw6IyUY/hqdefault.jpg?sqp=-oaymwEbCMQBEG5IVfKriqkDDggBFQAAiEIYAXABwAEG&rs=AOn4CLA6Q-O5wWBod6W3PXE8JMS1484hcQ', 'width': 196, 'height': 110}, {'url': 'https://i.ytimg.com/vi/AYU0vw6IyUY/hqdefault.jpg?sqp=-oaymwEcCPYBEIoBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLCD08KENYaqx17vmC0K3W2ks2cdHQ', 'width': 246, 'height': 138}, {'url': 'https://i.ytimg.com/vi/AYU0vw6IyUY/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLDMFCdIroiTRm6rjV9bg8Q3kEBcAA', 'width': 336, 'height': 188}]} title: {'runs': [{'text': 'Make a great-looking 3D landscape visualization! - Kristoffer Dyrkorn - NDC Oslo 2023'}], 'accessibility': {'accessibilityData': {'label': 'Make a great-looking 3D landscape visualization! - Kristoffer Dyrkorn - NDC Oslo 2023 by NDC Conferences 4 days ago 59 minutes 1,065 views'}}} descriptionSnippet: {'runs': [{'text': 'In this session you will learn all you need to create your own 3D landscape visualizations in the browser. I will go through:\n\n- where to get free, high quality, data\n- how to preprocess and...'}]} publishedTimeText: {'simpleText': '4 days ago'} lengthText: {'accessibility': {'accessibilityData': {'label': '59 minutes, 6 seconds'}}, 'simpleText': '59:06'} viewCountText: {'simpleText': '1,065 views'} navigationEndpoint: {'clickTrackingParams': 'COYBENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZWhhVQ1RkdzM4Q3c2amNtMGF0QlBBMzlhMFGaAQYQ8jgY4Ac=', 'commandMetadata': {'webCommandMetadata': {'url': '/watch?v=AYU0vw6IyUY', 'webPageType': 'WEB_PAGE_TYPE_WATCH', 'rootVe': 3832}}, 'watchEndpoint': {'videoId': 'AYU0vw6IyUY', 'watchEndpointSupportedOnesieConfig': {'html5PlaybackOnesieConfig': {'commonConfig': {'url': 'https://rr1---sn-1gi7znek.googlevideo.com/initplayback?source=youtube&oeis=1&c=WEB&oad=3200&ovd=3200&oaad=11000&oavd=11000&ocs=700&oewis=1&oputc=1&ofpcc=1&msp=1&odepv=1&id=018534bf0e88c946&ip=213.142.178.228&initcwndbps=1770000&mt=1689624085&oweuc=&pxtags=Cg4KAnR4Egg0NTE3MDA1OA&rxtags=Cg4KAnR4Egg0NTE3MDA1OA%2CCg4KAnR4Egg0NTE3MDA1OQ'}}}}} trackingParams: COYBENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZQMaSo_Twl83CAQ== showActionMenu: False shortViewCountText: {'accessibility': {'accessibilityData': {'label': '1K views'}}, 'simpleText': '1K views'} menu: {'menuRenderer': {'items': [{'menuServiceItemRenderer': {'text': {'runs': [{'text': 'Add to queue'}]}, 'icon': {'iconType': 'ADD_TO_QUEUE_TAIL'}, 'serviceEndpoint': {'clickTrackingParams': 'COsBEP6YBBgFIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True}}, 'signalServiceEndpoint': {'signal': 'CLIENT_SIGNAL', 'actions': [{'clickTrackingParams': 'COsBEP6YBBgFIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'addToPlaylistCommand': {'openMiniplayer': True, 'videoId': 'AYU0vw6IyUY', 'listType': 'PLAYLIST_EDIT_LIST_TYPE_QUEUE', 'onCreateListCommand': {'clickTrackingParams': 'COsBEP6YBBgFIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/playlist/create'}}, 'createPlaylistServiceEndpoint': {'videoIds': ['AYU0vw6IyUY'], 'params': 'CAQ%3D'}}, 'videoIds': ['AYU0vw6IyUY']}}]}}, 'trackingParams': 'COsBEP6YBBgFIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}, {'menuServiceItemDownloadRenderer': {'serviceEndpoint': {'clickTrackingParams': 'COoBENGqBRgGIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'offlineVideoEndpoint': {'videoId': 'AYU0vw6IyUY', 'onAddCommand': {'clickTrackingParams': 'COoBENGqBRgGIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'getDownloadActionCommand': {'videoId': 'AYU0vw6IyUY', 'params': 'CAI%3D'}}}}, 'trackingParams': 'COoBENGqBRgGIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}, {'menuServiceItemRenderer': {'text': {'runs': [{'text': 'Share'}]}, 'icon': {'iconType': 'SHARE'}, 'serviceEndpoint': {'clickTrackingParams': 'COYBENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/share/get_share_panel'}}, 'shareEntityServiceEndpoint': {'serializedShareEntity': 'CgtBWVUwdnc2SXlVWQ%3D%3D', 'commands': [{'clickTrackingParams': 'COYBENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'openPopupAction': {'popup': {'unifiedSharePanelRenderer': {'trackingParams': 'COkBEI5iIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'showLoadingSpinner': True}}, 'popupType': 'DIALOG', 'beReused': True}}]}}, 'trackingParams': 'COYBENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}], 'trackingParams': 'COYBENwwIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'accessibility': {'accessibilityData': {'label': 'Action menu'}}}} thumbnailOverlays: [{'thumbnailOverlayTimeStatusRenderer': {'text': {'accessibility': {'accessibilityData': {'label': '59 minutes, 6 seconds'}}, 'simpleText': '59:06'}, 'style': 'DEFAULT'}}, {'thumbnailOverlayToggleButtonRenderer': {'isToggled': False, 'untoggledIcon': {'iconType': 'WATCH_LATER'}, 'toggledIcon': {'iconType': 'CHECK'}, 'untoggledTooltip': 'Watch later', 'toggledTooltip': 'Added', 'untoggledServiceEndpoint': {'clickTrackingParams': 'COgBEPnnAxgBIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/browse/edit_playlist'}}, 'playlistEditEndpoint': {'playlistId': 'WL', 'actions': [{'addedVideoId': 'AYU0vw6IyUY', 'action': 'ACTION_ADD_VIDEO'}]}}, 'toggledServiceEndpoint': {'clickTrackingParams': 'COgBEPnnAxgBIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/browse/edit_playlist'}}, 'playlistEditEndpoint': {'playlistId': 'WL', 'actions': [{'action': 'ACTION_REMOVE_VIDEO_BY_VIDEO_ID', 'removedVideoId': 'AYU0vw6IyUY'}]}}, 'untoggledAccessibility': {'accessibilityData': {'label': 'Watch later'}}, 'toggledAccessibility': {'accessibilityData': {'label': 'Added'}}, 'trackingParams': 'COgBEPnnAxgBIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}, {'thumbnailOverlayToggleButtonRenderer': {'untoggledIcon': {'iconType': 'ADD_TO_QUEUE_TAIL'}, 'toggledIcon': {'iconType': 'PLAYLIST_ADD_CHECK'}, 'untoggledTooltip': 'Add to queue', 'toggledTooltip': 'Added', 'untoggledServiceEndpoint': {'clickTrackingParams': 'COcBEMfsBBgCIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True}}, 'signalServiceEndpoint': {'signal': 'CLIENT_SIGNAL', 'actions': [{'clickTrackingParams': 'COcBEMfsBBgCIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'addToPlaylistCommand': {'openMiniplayer': True, 'videoId': 'AYU0vw6IyUY', 'listType': 'PLAYLIST_EDIT_LIST_TYPE_QUEUE', 'onCreateListCommand': {'clickTrackingParams': 'COcBEMfsBBgCIhMIp_vlyoyZgAMVbswRCB3VdQcZ', 'commandMetadata': {'webCommandMetadata': {'sendPost': True, 'apiUrl': '/youtubei/v1/playlist/create'}}, 'createPlaylistServiceEndpoint': {'videoIds': ['AYU0vw6IyUY'], 'params': 'CAQ%3D'}}, 'videoIds': ['AYU0vw6IyUY']}}]}}, 'untoggledAccessibility': {'accessibilityData': {'label': 'Add to queue'}}, 'toggledAccessibility': {'accessibilityData': {'label': 'Added'}}, 'trackingParams': 'COcBEMfsBBgCIhMIp_vlyoyZgAMVbswRCB3VdQcZ'}}, {'thumbnailOverlayNowPlayingRenderer': {'text': {'runs': [{'text': 'Now playing'}]}}}] |
From this output we can work with dictionary- and list accessors to extract the data in which we are interested.
The YouTube data extractor
After a few rounds of experimenting with the return values of Scrapetube, I ended up with this extractor tool.
First, we need to import the libraries we use:
1 2 3 |
import scrapetube import datetime as dt import pandas as pd |
The duration of the talk is in the format of “hours:minutes:seconds
“, what will be a pain to work with in Pandas. Therefore, we convert the duration into minutes and round the seconds up or down to a full minute:
1 2 3 4 5 6 7 8 9 10 11 12 |
def duration_in_minutes(duration): duration_parts = duration.count(':') if(duration_parts == 1): temp_time = dt.datetime.strptime(duration, '%M:%S') extra_minute = 1 if temp_time.second >= 30 else 0 duration_minutes = temp_time.minute + extra_minute return duration_minutes else: temp_time = dt.datetime.strptime(duration, '%H:%M:%S') extra_minute = 1 if temp_time.second >= 30 else 0 duration_minutes = temp_time.hour * 60 + temp_time.minute + extra_minute return duration_minutes |
The format of the title changed over time. We can start with “title – speaker – conference” and then try to match partial results to something useful:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def extract_title(title_all, last_conference): title = 'Unknown' speaker = 'Unknown' conference = 'Unknown' counts = title_all.count(' - ') if(counts == 1): title, speaker = title_all.split(' - ') if(counts == 2): title, speaker, conference = title_all.split(' - ') if('NDC' not in conference.upper()): a, b, speaker = title_all.split(' - ') title = f"{a} - {b}" conference = 'Unknown' if (counts == 3): titlea, titleb, speaker, conference = title_all.split(' - ') title = titlea + ' - ' + titleb if(counts == 0): title = title_all conference = last_conference else: last_conference = conference return title, speaker, conference, last_conference |
With the helper methods in place, we can use Scrapetube to fetch the videos from the NDC channel, extract the data, create a DataFrame in Pandas, and export it as a CSV file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
talks = [] last_conference = 'Unknown' videos = scrapetube.get_channel("UCTdw38Cw6jcm0atBPA39a0Q") for video in videos: videoId = video['videoId'] published = video['publishedTimeText']['simpleText'] title_all = video['title']['runs'][0]['text'] title, speaker, conference, last_conference = extract_title(title_all, last_conference) lenght_text = video['lengthText']['simpleText'] lenght_minutes = duration_in_minutes(lenght_text) url = video['navigationEndpoint']['commandMetadata']['webCommandMetadata']['url'] talks.append([published, 'talk', conference, title, speaker, lenght_text, lenght_minutes, f"https://youtube.com{url}"]) print(f"{published} = {title} = {speaker} = {conference} = {lenght_text} = {lenght_minutes} = https://youtube.com{url}") column_names=['Published','Type','Conference','Title','Speaker','Duration', 'DurationInMinutes', 'Link'] stats_df = pd.DataFrame(talks, columns=column_names) stats_df.to_csv('ndc_talks_youtube.csv', index=False) |
When we run the extractor, we get a CSV file with around 3200 lines. We can check the conferences it could match with this code:
1 2 3 4 |
import pandas as pd df = pd.read_csv('ndc_talks_youtube.csv') print(df['Conference'].value_counts()) |
1 2 3 4 5 6 |
Conference Unknown 2200 NDC Oslo 2023 130 NDC Oslo 2021 112 NDC Oslo 2022 103 NDC London 2023 92 |
As expected, the many changes of the title format will require a lot of manual work to match it to the right conference. Nevertheless, for the conferences in the last few years it matched rather well, and we should be able to fix the data in a reasonable time frame.
Next
We started with our thesis of the shorter talks and found a data source that can (dis-) prove our assumption. With the help of Scrapetube we got the data from YouTube into a CSV file that we can turn next week into the source of our analysis work.
2 thoughts on “Python Friday #187: Extracting the NDC Talks Data From YouTube”