Last week we covered the basic parts to automate our browser with Selenium. If you try that with real web pages, you quickly notice that they are often not enough. In this post we look at few helpful tricks to make your Selenium automation more robust.
This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.
Read out values inside elements
In the last post we called the find_elements() method directly on the browser. However, we can fetch an element and then run the find_elements() on that specific element:
1 2 3 4 5 6 7 8 9 |
results_div = driver.find_element( by=By.CLASS_NAME, value="results") results = results_div.find_elements( by=By.TAG_NAME, value="h2") for result in results: print(f"* {result.text}") |
This code sample reads the results div
from DuckDuckGo and then selects the H2
elements inside that div
. When we run the code, we get something like this:
* Selenium
* WebDriver | Selenium
* Selenium WebDriver Tutorial – javatpoint
* Web elements | Selenium
* What is Selenium WebDriver Architecture? How Does it works? – TOOLSQA
* How to Use Selenium? | Complete Guide to Selenium WebDriver – EDUCBA
* Introduction to Web Scraping using Selenium – Medium
* Web Scraping with Selenium and Python – ScrapFly Blog
* Selenium – Webdriver – tutorialspoint.com
* Complete Selenium WebDriver Tutorial with Examples – LambdaTest
Especially in complex pages it is an immense help to use multiple steps and refine the selectors as you go to get to the desired result. This requires a bit more typing, but it makes your code more maintainable.
Scroll into position
As long as we only read elements, it does not matter if they are in the viewable part of the browser. That changes if we try to click them. If the element is not visible, it will often give us an error. Should you run into this problem, you can use this code to tell the browser to scroll down to that element:
1 2 3 4 5 |
more = driver.find_element( by=By.CLASS_NAME, value="result--more") driver.execute_script("arguments[0].scrollIntoView();", more) more.click() |
If our element is at the bottom of the page, we can use this approach to scroll at the end:
1 |
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') |
Wait on elements
The implicit wait time is a good general way to tell Selenium to wait a bit when we ask it for an element. We can do that directly on the browser instance and specify how many seconds it should wait:
1 |
driver.implicitly_wait(1) # 1 second |
However, there are cases we know that they will take a bit longer to render. To wait only on them the extra time and not on everything, we can use this code to wait:
1 2 3 4 5 6 7 8 9 10 |
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "mySlowElement")) ) except TimeoutException as error: print(f"{type(error)}: {error}") |
Our element mySlowElement
has now 10 seconds to show up before Selenium throws a TimeoutException exception:
<class ‘selenium.common.exceptions.TimeoutException’>: Message:
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.jsm:12:1
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:192:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:404:5
element.find/
Next
Those little tricks may help you to create a much more robust browser automation. Well-structured web pages may not need those extra steps, but most often the best data is on pages that are a mess. Next week we download website statistics from WordPress.com and circumvent all the JavaScript that the Jetpack extension uses.
Have you tried playwright? that seem to be the project that will replace puppeteer.
Hi Nono,
Yes, Playwright is interesting and next week I start a series of blog posts on it.
Regards,
Johnny