Web crawlers are still a hot topic. More and more data is on the web; unfortunately, often not in the form we need it. Today we look at requests, the elegant HTTP library for Python that gives us a good starting point to interact with web applications.
This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.
Install requests
We can install the newest version of requests with this command:
1 |
pip install -U requests |
There are a lot of special edge cases and tweaks to make requests work in the way certain web applications expect it. Should you run into some unique requirements, take a look at the extensive official documentation. You most likely will find a code example to handle your edge case.
To explore requests, you need a web application that can respond to the HTTP requests you send. Kenneth Reitz not only created requests, but he also made the demo application httpbin.org. You can use the online version or host a Docker image while you test requests. The code for this Flask app is on GitHub and it is a great help when you want to figure out how the different requests map get handled by Flask.
GET requests
We can create a GET request with the get() method that will call the web application on our behalf. The request results in a response object, which has the content of the URL in the .text property:
1 2 3 4 5 |
>>> import requests >>> r = requests.get('https://jgraber.ch') >>> r.text '<!DOCTYPE html>\r\n<html lang='... |
We can ask our response object for the status code to check if it was successful:
1 2 3 4 |
>>> r.status_code 200 >>> r.status_code == requests.codes.ok True |
Query strings
Encoding parameters in the URL for query strings is often a pain. With requests we can create a dictionary with the values we need and pass it to the get() method:
1 2 3 4 |
>>> payload = {'key1': 'value1', 'key2': 'value2'} >>> r = requests.get('https://httpbin.org/get', params=payload) >>> r.url 'https://httpbin.org/get?key1=value1&key2=value2' |
The application at httpbin.org returns JSON, that we can get with the json() method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
>>> r.json() { 'args': { 'key1': 'value1', 'key2': 'value2' }, 'headers': { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.27.1', 'X-Amzn-Trace-Id': 'Root=1-63068b8d-12d628e31969095f64f294be' }, 'origin': '213.142.178.228', 'url': 'https://httpbin.org/get?key1=value1&key2=value2' } |
HTTP headers
If we need to send special headers, we can create a dictionary and pass it to the get() method in a similar way as we passed the query strings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
>>> headers = {'user-agent': 'my-app/0.0.1'} >>> r = requests.get('https://httpbin.org/get', headers=headers) >>> r.json() { 'args': {}, 'headers': { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'my-app/0.0.1', 'X-Amzn-Trace-Id': 'Root=1-63068c4d-069d88c374a297341d678835'}, 'origin': '213.142.178.228', 'url': 'https://httpbin.org/get' } |
If we compare this output with the one above, we can see that the “User-Agent” header uses our supplied value.
We can access the headers of the response object to figure out what data we got back:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
>>> r = requests.get('http://jgraber.ch') >>> r.headers { 'Date': 'Wed, 24 Aug 2022 20:53:40 GMT', 'Content-Type': 'application/json', 'Content-Length': '309', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true' } >>> r.headers['Server'] 'gunicorn/19.9.0' >>> r.headers['Content-Type'] 'application/json' |
Post form data
To send data, we can create a dictionary for the values and pass it as data property to the post() method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
>>> r = requests.post('https://httpbin.org/post', data={'key': 'value'}) >>> r.json() { 'args': {}, 'data': '', 'files': {}, 'form': {'key': 'value'}, 'headers': { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '9', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.27.1', 'X-Amzn-Trace-Id': 'Root=1-63068d0a-7104245d043d455c717e2d99'}, 'json': None, 'origin': '213.142.178.228', 'url': 'https://httpbin.org/post' } |
Follow redirects
Requests follows redirects automatically. If we call my web site on HTTP, the server sends a redirect to the HTTPS version. We may miss that redirect because we get the status code of the last action. However, there is a history property that contains the redirect response:
1 2 3 4 5 6 7 8 9 |
>>> r = requests.get('http://jgraber.ch') >>> r.status_code == requests.codes.ok True >>> r.is_redirect False >>> r.status_code 200 >>> r.history [<Response [301]>] |
Other HTTP methods
While get() and post() are the most used HTTP methods, requests supports the full range of commands.
The head() method is great to see if a URL exists without downloading the content:
1 2 3 4 5 6 7 8 |
>>> r = requests.head('https://httpbin.org/get') >>> r.text '' >>> r.status_code 200 >>> r = requests.head('https://httpbin.org/get5') >>> r.status_code 404 |
The options() method is useful when you need to find out what methods are supported for a specific URL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
>>> r = requests.options('https://httpbin.org/get') >>> r.text '' >>> r.headers { 'Date': 'Wed, 24 Aug 2022 20:55:07 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '0', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Allow': 'OPTIONS, GET, HEAD', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Methods': 'GET, POST, PUT, DELETE, PATCH, OPTIONS', 'Access-Control-Max-Age': '3600' } |
If your application supports the PUT and DELETE command, you can use the methods put() or delete() to modify data:
1 2 3 |
r = requests.put('https://httpbin.org/put', data={'key': 'value'}) r = requests.delete('https://httpbin.org/delete') |
Next
With requests we can interact with web applications over HTTP(S). Whatever you need to do, if it is part of HTTP, you find a way to do it with requests. The interesting part of web crawling starts with the data we get back. Next week we find out how we can clean up that HTML and extract what we need.
3 thoughts on “Python Friday #137: HTTP With Requests”