Web Scraping Search Results

Web Scraping Search Results Examples
Web Scraping Search Results Example
Web Scraping Google Search Results In R
Web Scraping Search Results Free
Web Scraping Search Results Online

In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.

The ScrapeHero Cloud allows you to scrape Google search result pages for a variety of search terms in a fast and cost-effective manner. Using the ScrapeHero Cloud, you can scrape Google search results to gather details from Google Knowledge Graph, monitor organic and paid search results, gather news articles, and more within a few clicks. You can search and scan the search results but it is too time-consuming and tedious. Web scraping is used to scrape the data from different websites and glean actionable intelligence from these sites in terms of equity research. Using a web scraping tool is the easiest and the cheapest way to collect information from Google. Google hides Google results data in the search results as duplicates. If anyone attempts to scrape the search results, Google can block their IP addresses. Web scraping is the process of gathering information from the Internet. Even copy-pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation. Some websites don’t like it when automatic scrapers gather their data, while others don’t mind.

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

This is a specific form of screen scraping or web scraping dedicated to search engines only. Most commonly larger search engine optimization (SEO) providers depend on regularly scraping keywords from search engines, especially Google, to monitor the competitive position of their customers' websites for relevant keywords or their indexing status.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

Selenium is useful when you have to perform an action on a website such as:

Clicking on buttons
Filling forms
Scrolling
Taking a screenshot

It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

selenium package

To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:

Quickstart

Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:

This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.

To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:

The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

driver.title gets the page's title
driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

Locating Elements

Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).

There are many methods available in the Selenium API to select elements on the page. You can use:

Tag name
Class name
IDs
XPath
CSS selectors

We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:

find_element

There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:

All these methods also have find_elements (note the plural) to return a list of elements.

For example, to get all anchors on a page, use the following:

Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).

XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.

WebElement

A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those HTML elements, here are the most useful:

Accessing the text of the element with the property element.text
Clicking on the element with element.click()
Accessing an attribute with element.get_attribute('class')
Sending text to an input with: element.send_keys('mypassword')

There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.

It can be interesting to avoid honeypots (like filling hidden inputs).

Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.

That's a classic honeypot.

Full example

Here is a full example using Selenium API methods we just covered.

We are going to log into Hacker News:

In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.

In order to authenticate we need to:

Go to the login page using driver.get()
Select the username input using driver.find_element_by_* and then element.send_keys() to send text to the input
Follow the same process with the password input
Click on the login button using element.click()

Should be easy right? Let's see the code:

Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?

We could try a couple of things:

Check for an error message (like “Wrong password”)
Check for one element on the page that is only displayed once logged in.

So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!

We can't just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException exception:

Taking a screenshot

We could easily take a screenshot using:

Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.

In our Hacker News case it's simple and we don't have to worry about these issues.

Waiting for an element to be present

Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.

If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:

Use a time.sleep(ARBITRARY_TIME) before taking the screenshot.
Use a WebDriverWait object.

If you use a time.sleep() you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.

This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:

element_to_be_clickable
text_to_be_present_in_element
element_to_be_clickable

You can find more information about this in the Selenium documentation

Executing Javascript

Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:

Conclusion

I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.

Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API

Web Scraping Search Results Examples

Selenium is also an excellent tool to automate almost anything on the web.

If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd:

The hardest part about web scraping can be getting to the data you want to scrape.

For example, you might want to scrape data from a search results page for a number of keywords.

You mighty setup separate scraping projects for each keyword.

However, there are powerful web scrapers that can automate the searching process and scrape the data you want.

Today, we will set up a web scraper to search through a list of keywords and scrape data for each one.

A Free and Powerful Web Scraper

For this project, we will use ParseHub. A free and powerful web scraper that can scrape data from any website. Make sure to download and install ParseHub for free before we get started.

We will also scrape data from Amazon’s search result page for a short list of keywords.

Searching and Scraping Data from a List of Keywords

Now it’s time to setup our project and start scraping data.

Install and Open ParseHub. Click on “New Project” and enter the URL of the website you will be scraping from. In this case, we will scrape data from Amazon.ca. The page will then render inside the app and allow you to start extracting data.
Now, we need to give ParseHub our list of keywords we will be searching through to extract data. To do this, click on the settings icon at the top left and click on “settings”.

Under the “Starting Value” section you can enter your list of keywords either as a CSV file or in JSON format right in the text box below it.

If you’re using a CSV file to upload your keywords, make sure you have a header cell. In this case, it will be the word “keywords”.

Once you’ve submitted your list of keywords, click on “Back to Commands” to go back to your project.Click on the PLUS (+) sign next to your “page” selection, click on Advanced and click on the “Loop” command.

By default, your list of keywords will be selected as the list of items to loop through. If not, make sure to select “keywords” from the dropdown.

Click on the PLUS (+) sign next to your “For each item in keywords” selection and choose the “Begin New Entry” command. This command will be named “list1” by default.

Click on the PLUS (+) sign next to your “list1” tool and choose the “Select” command.

With the select command, click directly on the Amazon search bar to select it.

This will create an input command, under it, choose “expression” from the dropdown and enter the word “item” on the text box.

Now we will make it so ParseHub adds the keyword for each result next to it. To do this, click on the PLUS (+) sign next to the “list1” command and choose the “extract” command.

Under the extract command, enter the word “item” into the first text box.

Now, let’s tell ParseHub to perform the search for the keywords in the list. Click on the PLUS (+) sign next to your “list1” selection and choose the select command.

Click on the Search Button to select it and rename it to “search_bar”

Click on the PLUS (+) sign next to your “search_bar” selection and choose the “Click” command

Web Scraping Search Results Example

A pop-up will appear asking you if this is a “next page” button. Click on “No” and rename your new template to “search_results”

Now, let’s navigate to the search results page of the first keyword on the list and extract some data.
Start by switching over to browse mode on the top left and search for the first keyword on the list.

Once the page renders, make sure you are still working on your new “search_results” template by selecting it with the tabs on the left.

Now, turn off Browse Mode and click on the name of the first result on the page to select it. It will be highlighted in green to indicate that it has been selected.

The rest of the products on the page will be highlighted in yellow. Click on the second one on the list to select them all.

ParseHub is now extracting the product name and URL for each product on the first page of results for each keyword.

Do you want to extract more data from each product? Check out our guide on how to scrape product data from Amazon including prices, ASIN codes and more.

Do you want to extract more pages worth of data? Check out our guide on how to add pagination to your project and extract data from more than one page of results.

Running your Scrape

Web Scraping Google Search Results In R

It is now time to run your project and export it as an Excel file.

To do this, click on the green “Get Data” button on left sidebar.

Here you will be able to test, run or schedule your scrape job. In this case, we will run it right away.

Closing Thoughts

Web Scraping Search Results Free

ParseHub is now off to extract all the data you’ve selected. Once your scrape is complete, you will be able to download it as a CSV or JSON file.

You now know how to search through a list of keywords and extract data from each result.

Web Scraping Search Results Online

What website will you scrape first?