Scrape Anything with ScraperAPI

Scrape Anything with ScraperAPI

An overview with code samples

ยท

7 min read

Disclosure: I get a small commission from sign-ups via links on this page, at no additional cost to you.

Web scraping is tough. From getting blocked by websites to parsing incomplete data, many roadblocks can slow you down, reroute you, or stop you entirely. ScraperAPI removes the headaches so you can focus on your project's big picture instead of the intricacies of web scraping.

What is ScraperAPI?

You guessed it - it's an API service that makes web scraping a breeze. Rather than requesting data from websites directly, you can route all of your requests through ScraperAPI's endpoints to take advantage of their rich feature set. It acts as a middleman to make web scraping requests for you so that you won't be blocked and then returns the correct data in various formats.

Here's a quick example using Python's requests package. Let's scrape a page I built with React. Traditional websites render HTML on the server, so requests to scrape them will return the full HTML you're looking for. Many modern websites use popular frameworks like React to render it in the browser after your request is made, meaning the HTML from the server will essentially be empty at first. ScraperAPI has a feature to render the HTML for you before returning it, making it super simple to access the data you need.

Without ScraperAPI -

import requests

url = 'https://state-management.willbraun.dev'

r = requests.get(url)
print(r.text)

Response

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8" />
        <link rel="icon" type="image/svg+xml" href="/vite.svg" />
        <meta name="viewport" content="width=device-width, initial-scale=1.0" />
        <title>State Management Patterns</title>
        <script type="module" crossorigin src="/assets/index-8d0aa6b9.js"></script>
        <link rel="stylesheet" href="/assets/index-cda38e90.css">
    </head>
    <body>
        <div id="root"></div>
        // NOTE - No content is loaded yet!
    </body>
</html>

Notice the body of the response has one empty element - not much to work with! Let's try using ScraperAPI, with JavaScript rendering enabled. I'll be using the payload parameter of Python requests to build the full URL to send to ScraperAPI, and I've stored my API key in an environment variable.

import requests
from decouple import config

url = 'https://state-management.willbraun.dev'
payload = {
  'api_key': config('API_KEY'), 
  'url': url, 
  'render': 'true', 
}

r = requests.get('https://api.scraperapi.com', params=payload)
print(r.text)

Full URL

https://api.scraperapi.com/?api_key=<your_api_key>&url=https://state-management.willbraun.dev&render=true

Response

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <link rel="icon" type="image/svg+xml" href="/vite.svg">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>State Management Patterns</title>
        <script type="module" crossorigin="" src="/assets/index-8d0aa6b9.js"></script>
        <link rel="stylesheet" href="/assets/index-cda38e90.css">
    </head>
    <body>
        <div id="root">
            <header>
                <nav>
                    <a href="https://blog.willbraun.dev/demystifying-state-management" target="_blank" rel="noopener noreferrer">Blog Post</a>
                    <a href="https://github.com/willbraun/state-mgmt-patterns" target="_blank" rel="noopener noreferrer">GitHub</a>
                </nav>
            </header>
            <p>OFF</p>
            <button>Toggle with Prop Drilling</button>
            <p>OFF</p>
            <button>Toggle with Context</button>
            <p>OFF</p>
            <button>Toggle with Zustand</button>
            <p>OFF</p>
            <button>Toggle with Redux</button>
        </div>
    </body>
</html>

Look at all that HTML! This problem is a thing of the past, and ScraperAPI's other features are just as easy to set up.

Features

Rotating IP Addresses

ScraperAPI automatically rotates through millions of IP addresses as it scrapes data, which provides you with a variety of benefits.

Many websites have measures in place to prevent scraping. ScraperAPI distributes scraping requests across many IPs and can automatically retry on failed requests, greatly reducing the likelihood of triggering IP blocks.

Some websites may impose rate limits on the number of requests that can be made from a single IP address within a specific time frame. Rotating IP addresses allows you to bypass these limits and scrape more data without being throttled.

JavaScript Rendering

This is the feature from our example. JavaScript running in the browser may be required to render the part of the HTML. It is more complicated to scrape as the initial response from the server is incomplete. You have to wait for the JavaScript to run after you send the request to the server.

By adding &render=true to your API call, ScraperAPI will handle this step for you. It will render all of the JavaScript on your desired page, and return the full HTML to you.

Auto Parse to JSON

HTML is typically returned from a scrape request as a long string of text. This may be fine for your use case since it is standard, but the next step is usually to transform the response into a more workable format so that you can find what you need.

By adding &autoparse=true to your API call, ScraperAPI will format the response as JSON if possible. One less step for you to worry about!

Geolocation

Sometimes websites show different data depending on what part of the world you are requesting data from. To show results for a particular location, you can specify the geolocation of the IP addresses used by adding the &country_code parameter to your API call. The available countries are listed in the ScraperAPI documentation.

Structured Data Endpoints

Trying to scrape Amazon or Google Search? ScraperAPI has you covered with prebuilt endpoints that return relevant data as JSON. This is a powerful feature that can really accelerate you towards your project goals. Here are the services currently available.

  • Amazon Search

  • Amazon Product Page

  • Google Search Engine Result Page (SERP)

Let's check out the Amazon Search structured data endpoint, which returns Amazon search results from any search query. I was recently researching computer monitors, so let's see what we find with this method.

import requests
from decouple import config
import json

payload = {
  'api_key': config('API_KEY'), 
  'query': 'computer monitors',  
}

r = requests.get(
    'https://api.scraperapi.com/structured/amazon/search', 
    params=payload
)
parsed = json.loads(r.text)
print(json.dumps(parsed, indent=2))

Full URL

https://api.scraperapi.com/structured/amazon/search?api_key=<your_api_key>&query=computer+monitors

Response (shortened, it's a lot of data)

{
  "ads": [
    {
      "name": "SAMSUNG 32\" Odyssey G55A QHD 165Hz 1ms FreeSync Curved Gaming Monitor with HDR 10, Futuristic Design for Any Desktop (LS32AG550ENXZA)",
      "asin": "B09TMJ9LGR",
      "brand": "Samsung Gaming Monitors",
      "image": "https://m.media-amazon.com/images/I/81a+yL6ii9L.jpg",
      "has_prime": false,
      "is_best_seller": false,
      "is_amazon_choice": false,
      "is_limited_deal": false,
      "stars": 4.4,
      "total_reviews": 0,
      "url": "https://aax-us-iad.amazon.com/x/c/RPqPaOmO9TkKF7V6K9QDMy0AAAGLCkRA-gEAAAH2AQBvbm9fdHhuX2JpZDMgICBvbm9fdHhuX2ltcDEgICA-fF7_/https://www.amazon.com/gp/aw/d/B09TMJ9LGR/?_encoding=UTF8&pd_rd_plhdr=t&aaxitk=e43e49e9df54e5c84f02ddf50d96ae4a&hsa_cr_id=0&qid=1696684327&sr=1-1-9e67e56a-6f64-441f-a281-df67fc737124&ref_=sbx_be_s_sparkle_mcd_asin_0_bkgd&pd_rd_w=9Nsbn&content-id=amzn1.sym.cd95889f-432f-43a7-8ec8-833616493f4a%3Aamzn1.sym.cd95889f-432f-43a7-8ec8-833616493f4a&pf_rd_p=cd95889f-432f-43a7-8ec8-833616493f4a&pf_rd_r=9X10K8EY448WDY60KJJV&pd_rd_wg=lpoc4&pd_rd_r=93c1edaf-f358-4104-bbb9-6e608dc2b024",
      "type": "top_stripe_ads"
    },
    <more ads>
  ],
  "results": [
    {
      "type": "search_product",
      "position": 1,
      "asin": "B0773ZY26F",
      "name": "Sceptre 24-inch Professional Thin 1080p LED Monitor 99% sRGB 2x HDMI VGA Build-in Speakers, Machine Black (E248W-19203R Series)",
      "image": "https://m.media-amazon.com/images/I/81zM2vVM+wL.jpg",
      "has_prime": false,
      "is_best_seller": false,
      "is_amazon_choice": false,
      "is_limited_deal": false,
      "stars": 4.6,
      "total_reviews": 30226,
      "url": "https://www.amazon.com/Sceptre-E248W-19203R-Monitor-Speakers-Metallic/dp/B0773ZY26F/ref=sr_1_1?keywords=computer+monitors&qid=1696684327&sr=8-1",
      "availability_quantity": null,
      "spec": {},
      "price_string": "$89.56",
      "price_symbol": "$",
      "price": 89.56
    },
    {
      "type": "search_product",
      "position": 2,
      "asin": "B0148NNKTC",
      "name": "Acer 23.8\u201d Full HD 1920 x 1080 IPS Zero Frame Home Office Computer Monitor - 178\u00b0 Wide View Angle - 16.7M - NTSC 72% Color Gamut - Low Blue Light - Tilt Compatible - VGA HDMI DVI R240HY bidx",
      "image": "https://m.media-amazon.com/images/I/91K9SyGiyzL.jpg",
      "has_prime": true,
      "is_best_seller": false,
      "is_amazon_choice": false,
      "is_limited_deal": false,
      "stars": 4.7,
      "total_reviews": 15156,
      "url": "https://www.amazon.com/Acer-Frame-Office-Computer-Monitor/dp/B0148NNKTC/ref=sr_1_2?keywords=computer+monitors&qid=1696684327&sr=8-2",
      "availability_quantity": null,
      "spec": {},
      "price_string": "$99.99",
      "price_symbol": "$",
      "price": 99.99
    },
    {
      "type": "search_product",
      "position": 3,
      "asin": "B0BS9TDY31",
      "name": "Acer KB272 Hbi 27\" Full HD (1920 x 1080) Zero-Frame Gaming Office Monitor | AMD FreeSync Technology | 100Hz | 1ms (VRB) | Low Blue Light | Tilt | HDMI & VGA Ports,Black",
      "image": "https://m.media-amazon.com/images/I/81FTa3aSdnL.jpg",
      "has_prime": true,
      "is_best_seller": false,
      "is_amazon_choice": false,
      "is_limited_deal": false,
      "stars": 4.6,
      "total_reviews": 2158,
      "url": "https://www.amazon.com/KB272-Hbi-Zero-Frame-FreeSync-Technology/dp/B0BS9TDY31/ref=sr_1_3?keywords=computer+monitors&qid=1696684327&sr=8-3",
      "availability_quantity": null,
      "spec": {},
      "price_string": "$149.99",
      "price_symbol": "$",
      "price": 149.99
    },
    <many more search results>
  ],
  "explore_more_items": [],
  "pagination": [
    "https://www.amazon.com/s?k=computer+monitors&page=2&qid=1696684327&ref=sr_pg_2",
   <more page links>
  ]
}

More

See all features, including the following, in the full documentation here.

  • Concurrent threads

  • Automated retries

  • Async requests

  • Proxy ports

  • SDKs

Let's get scraping!

There's no substitute for having the right tool for the job. Scraping without assistance can cost you valuable time searching for a usable site, wading through errors, and formatting data. ScraperAPI is a dead-simple way to level up your scraping game and put those challenges behind you.

ScraperAPI has a generous free tier that offers 5,000 API credits per month and paid plans that scale to your needs. They also offer professional support to assist you with your setup. The sign-up process is painless, and free plans do not require a credit card. I was able to start scraping with it in less than 5 minutes.

If only I had known about this when I built a scraping tool to simulate bets on tennis matches (link here). Since I was scraping websites directly, I was getting blocked and dealing with incomplete data. It would have been a much faster and smoother experience with ScraperAPI.

Did you find this article valuable?

Support Will Braun by becoming a sponsor. Any amount is appreciated!

ย