Web Scraping with Python

09.07.21 в 10:10 Other 4369

How to web scrape with python? It's a question that many beginners have. At the entry level, the process is quite simple, and anyone can quickly get their project off the ground. However, to successfully work on such a task, you should not forget about many aspects, which are not easy to understand at once.

In the first section of this article, we describe the simple steps for creating your scraping bot. You can easily use these approaches for your project. It requires slight improvements to adapt it to your case. In addition, in the second section, you can find additional tips to extend functionality and maximize the optimization of scraping.

What is scraping?

Scraping is the process of extracting data from certain publicly accessible web pages. GET-request is used to obtain data from the servers of the target resource through a browser or its emulator. The elements that contain the required information are selected from the resulting HTML page. This step is called parsing of web pages. The resulting target data is stored in a database or file system.

The following sections briefly review all three stages of scraping: requesting a target resource, data parsing, and storing results in the file system.

Scraping bot in Python: the beginning

As the first step to your bot, we recommend creating a new project in one of the popular IDEs. To work with Python, you can use VS Code, PyCharm, Jupyter, Spider, or other similar solutions. The examples below use the Python 3.8 interpreter. If you plan to work on this project regularly, we recommend creating a git repository for it right away.

When the project is created, create, and run a virtual environment. It is necessary because we will be working with many external libraries. A detailed guide on how to use the virtual environment you can find in the python documentation: https://docs.python.org/3/tutorial/venv.html

Installing additional libraries

It is enough to install three external libraries for the basic examples from the first part of this article to work. Install via pip in the running virtual environment:

    $ pip install BeautifulSoup4 pandas selenium

Selenium will be used for browser automation and sending HTTP requests. BeautifulSoup is the most powerful library for parsing HTML pages, helps to choose the required elements. Pandas is used to write data to a file.

If you want to try all the examples given in the article, we advise you to install the whole set of used requirements at once. To do this, create a requirements.txt file in the project directory and save all the requirements below:

    beautifulsoup4==4.9.3
    blinker==1.4
    certifi==2020.12.5
    cffi==1.14.5
    chardet==4.0.0
    cryptography==3.4.7
    et-xmlfile==1.1.0
    gevent==21.1.2
    greenlet==1.1.0
    grequests==0.6.0
    h11==0.12.0
    h2==4.0.0
    hpack==4.0.0
    hyperframe==6.0.1
    idna==2.10
    kaitaistruct==0.9
    lxml==4.6.3
    MechanicalSoup==1.0.0
    numpy==1.20.3
    openpyxl==3.0.7
    pandas==1.2.4
    pyasn1==0.4.8
    pycparser==2.20
    pydivert==2.1.0
    pyOpenSSL==20.0.1
    pyparsing==2.4.7
    PySocks==1.7.1
    python-dateutil==2.8.1
    pytz==2021.1
    requests==2.25.1
    selenium==3.141.0
    selenium-wire==4.3.0
    six==1.16.0
    soupsieve==2.2.1
    urllib3==1.26.4
    wsproto==1.0.0
    zope.event==4.5.0zope.interface==5.4.0

To install them, issue a command:

    $ pip install -r requirements.txt

After an installation, you can start working; all the code samples will work correctly.

WebDriver for a browser

This time we will use the Selenium library to retrieve data from web pages. It allows you to run the standard browser and manage it automatically. Although headless browsers are designed for large-scale data scraping tasks, at the first stage of development, it is more convenient to use regular browsers. In our examples, we use Chrome. That said, the Firefox experience is not much different.

To control your browser with Python, you need to install a webdriver and put it into a local folder of your PC. You can download it for Chrome here.

Find the suitable driver version for your operating system, install it, and unzip it. After that, put the web driver in a folder that you can easily access from the Python project.

Python + Chrome: extracting data from a website

Now it's time to use a recently downloaded web driver and run the browser in Python. Let's start by importing the selenium library:

    from selenium import webdriver

Then create a driver object for Chrome:

   browser = webdriver.Chrome(
     executable_path='c://path/to/your/chromedriver.exe'
   )

The browser needs to provide a link or set of links to sites that will open on command from the python app. Declare a list that will contain the destination URLs database:

    urls = [
      'https://rsocks.net/',
      'https://rsocks.net/server-proxy',
      'https://rsocks.net/mobile-proxy',
      ...
   ]

To open each of these pages in turn, we loop through the URLs array, make a GET request to each page, and close the browser window after the last request. As a result, the python executable file will look like this:

    from selenium import webdriver
    browser = webdriver.Chrome(
        executable_path='c://path/to/your/chromedriver.exe'
    )
    urls = [
        'https://rsocks.net/',
        'https://rsocks.net/server-proxy',
        'https://rsocks.net/mobile-proxy',
    ...
    ]
    for url in urls:
        browser.get(url)
    browser.quit()

After startup, the browser window should open, in which all the pages of the list will be sequentially loaded. If it did, the code works as intended, and you can move on to the next step.

Parsing data with BeautifulSoup

The pages loaded successfully in the browser, but we have not extracted any data from them. We will do this with one of the most popular parsing libraries - BeautifulSoup. In this section, we will try to use it to extract titles and headers from web pages.

We start by importing the module and creating empty lists to store the target data:

    from bs4 import BeautifulSoup
    titles = []
    headers = []

Next, let's modify the page request process by storing its content in the 'content' variable and processing it with the built-in HTML parser from BeautifulSoup:

    browser.get(url)
    content = browser.page_source
    soup = BeautifulSoup(content, features='html.parser'

Now store the individual elements of the page in the variables and add them to the prepared lists if they are not already contained:

    title = soup.find('title').text
    header = soup.find('h1').text
        if title not in titles:
    titles.append(str(title))
        if header not in headers:
    headers.append(str(header))

To make sure that the data was extracted correctly, output it to the console:

    print(titles)
    print(headers)

Great! The page is requested, the required data is extracted, but with a large amount of information it needs to be stored in a convenient format.

Popular modules in Python allow you to save information in databases in various file formats: TXT, CSV, XLSX, RTF, etc. We will use one of the most famous big data libraries – Pandas.

Exporting data with Pandas

The Pandas library offers data structures for easy and efficient dealing with a vast array of diverse information. In our case, we will use the DataFrame data type and store the resulting lists with headers.

You can store arrays, lists, dictionaries, and other similar objects in DataFrame. Import the Pandas module and save the dictionary with the results of page parsing:

    import pandas as pd

     df = pd.DataFrame(
        {
            'Title': titles,
            'Headers': headers,
        }
     )

You can use the broad functionality provided by Pandas for further data processing. Let's use it to export it to an Excel file:

    df.to_excel('titles.xlsx', sheet_name="Sheet1")

The final code looks like this:

    from selenium import webdriver
    from bs4 import BeautifulSoup
    import pandas as pd

    browser = webdriver.Chrome(
        executable_path='c://path/to/your/chromedriver.exe'
    )

    titles = []
    headers = []

    urls = [
        'https://rsocks.net/',
        'https://rsocks.net/server-proxy',
        'https://rsocks.net/mobile-proxy',
        ...
    ]

    for url in urls:
        browser.get(url)
        content = browser.page_source
       
        soup = BeautifulSoup(content, features='html.parser')
        title = soup.find('title').text
        header = soup.find('h1').text
        
        if title not in titles:
            titles.append(str(title))
        if header not in headers:
            headers.append(str(header))
    df = pd.DataFrame(
       {
           'Title': titles,
           'Headers': headers,
       }
    )
    
    df.to_excel('titles.xlsx', sheet_name="Sheet1")
      
    browser.quit()

As a result of the program's work, an Excel file will appear in the working directory with the scraping results. Congratulations! We have implemented simple web scraping in Python!

Scraping Bot in Python: Part II

The examples reviewed above are great for understanding how scrap bots work, but it takes continuous tweaking and improvement to accomplish tasks fully. Below we add a few more features to improve scraping in Python.

Requests through proxy server

If you make a lot of requests to the same web resource, you might face access restriction issues. A lot of services do their best to restrict scraping bots usage. Most often they do that by tracking your IP. If all the requests to your target service will come from the same IP, you can temporarily get blocked from particular pages or get your requests limit restricted.

Proxy servers might help with performing requests from different IP-addresses. Working through a proxy can help get around requests limit as every request will be coming from a proxy IP which can be anywhere around the world.

Let’s set up a request using a proxy in Selenium. It’s enough to add several parameters to an already used web-driver.

To connect to a proxy you’ll need its IP and port number. Let’s save them in separate dictionary:

    proxy = {
        'address': 'IP:port',
        'username': 'proxy_login',
        'password': 'proxy_password',}

Selenium features allow you to connect through proxies that support http/socks protocols and different authorization types. In examples below we use socks5 connection with IP authorization and http with username/password access.

The simplest way to use a proxy in Selenium is with IP authorization or without any authorization at all. In this case browser won’t have to surpass a username and password to access your proxy. To work with such a proxy you’ll just need to add a few more options to the previous example.

In order to make the browser work with a proxy you’ll just need to set it up in its parameters. For that Selenium has separate class Options. Let’s import it and create an object to set the parameters of browser we’re launching:

    from selenium.webdriver.chrome.options import Options
    options = Options()

To configure some particular settings you can use the add_argument method. Let’s use it to set a proxy up:

    options.add_argument('--proxy-server=socks5://IP:port')

To check if proxy functions properly let’s open a site which indicates visitor’s IP. Executive file will look like that:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    proxy = {
         'address': 'IP:port',
         'username': 'proxy_login',
         'password': 'proxy_password',
    }
    
    options = Options()
    options.add_argument(f'--proxy-server=socks5://{proxy["address"]}')
   
    chrome = webdriver.Chrome(
        executable_path='c://path/to/your/chromedriver.exe',
        options=options
    )

    chrome.get('http://ipinfo.io')

Browser should open an IP indication page. If connection is successful, the address on the page will match with the external IP of your proxy.

When you use a proxy with username and password authorization, the approach is slightly different. Before your browser is launched, it generates an extension which sets a proxy up.

The extension for Chrome consists of manifest.json file which describes main authorization parameters and executable JavaScript file which stores extension functionality itself. Both of the created files are added to ZIP-archive. These operations can be performed manually but you can also automate it with Python.

To create files and archive them with Python let’s use the zipfile internal module. Then we need to create separate executable file create_plugin.py and write following instructions inside:

    import zipfile

    manifest_json = """
    {
       "version": "1.0.0",
       "manifest_version": 2,
       "name": "Chrome Proxy",
       "permissions": [
           "proxy",
           "tabs",
           "unlimitedStorage",
           "storage",
           "<all_urls>",
           "webRequest",
           "webRequestBlocking"
       ],
       "background": {
           "scripts": ["background.js"]
       },
   "minimum_chrome_version":"22.0.0"
    }
    """

    background_js = """
    var config = {
        mode: "fixed_servers",
        rules: {
            singleProxy: {
                scheme: "http",
                host: "%s",
                port: parseInt(%s)
        },
        bypassList: ["localhost"]
       }
     };
    chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
    function callbackFn(details) {
        return {
            authCredentials: {
                username: "%s",
                password: "%s"
            }
        };
    }

    chrome.webRequest.onAuthRequired.addListener(
           callbackFn,
           {urls: ["<all_urls>"]},
           ['blocking']
    );
    """ % ('proxy_IP', 'proxy_port', 'username', 'password')

    plugin_file = 'proxy_auth_plugin.zip'

    with zipfile.ZipFile(plugin_file, 'w') as zp:
       zp.writestr("manifest.json", manifest_json)
       zp.writestr("background.js", background_js)

After executing this file you’ll see the proxy_auth_plugin.zip archive in the working directory with the desired extension.

To launch browser with created extension same method is used: creating Options object and adding new arguments to it:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options

    options = Options()
    options.add_argument(
        '--load-extension=c://path/to/your/proxy_auth_plugin'
    )

    chrome = webdriver.Chrome(
        executable_path='c://path/to/your/chromedriver.exe',
        options=options
    )
    chrome.get('http://ipinfo.io')

This method allows using http-proxy with username and password authorization.

How do I choose proxies which fit scraping?

Fast and successful data collection from the target resource can be implemented when you choose proxies separately for each task. Proxy market offers a big range of package configurations. For a conscious choice of fitting offer you need to understand basic characteristics by which various options that you can buy differ most often.

The most comfortable way to divide proxies will be the type of device that it’s used on. To be completely precise, the important thing is how this device goes into the Web, which defines its IP. By this indicator proxies are divided into mobile, residential and server ones. All of these types can be used for data scraping but convenience of using each of them is restricted by technical nuances of the task and planned budget.

Mobile proxies receive their IP-addresses from real mobile operators. Mobile network users have quite a high trust level even from the side of such demanding services as Amazon, Facebook, Instagram, Spotify and so on. This feature is used when working with mobile proxies giving you an opportunity to work with any service and collect any data with ease.


This type of proxies have its disadvantages. The biggest of them is the high cost of mobile traffic. That’s the reason why mobile proxies price is higher than residential or server analogues. Another drawback is the low speed of mobile internet compared to the wired one. It affects mobile proxies as well, they are far from being data transmission record holders.

On RSocks you can find a huge variety of mobile proxies from 22 countries from all around the globe. The big advantage of our proxies is unlimited traffic which makes your work way easier as you no longer have to consider how much traffic you have left.

Residential proxies work on a device connected to the Web through usual Internet services providers. That’s why their IPs don’t differ from the average client working on a PC. It allows proxy to keep high performance because the speed of home Internet connection is quite high in the majority of countries.


Anonymity and high trust level these proxies have from the side of target resources are enough for performing most of data scraping tasks. The advantage of this proxy type compared to mobile solutions is relatively low price, which gives you an opportunity to purchase a larger pool for your work.

The biggest flaw which can make it hard to work with residential proxies is high dependence of their characteristics on package’s quality. In case of using free or budget residential proxies, connection speed, ping and uptime of proxies inside sometimes can be lower than expected. In this case the package will only fit a small data volume simultaneously transferred through a proxy.

Good solution when choosing residential proxies will be one of World Mix packages(https://rsocks.net/socks-https-proxy). Buying it you receive a list of 9 000 - 27 000 residential SOCKS5/SOCKS4/HTTP proxies which rotate by up to 30% every two hours. Geolocation of IP-addresses is mixed, there are countries from every continent.

If there’s no need for unlimited traffic, the best decision will be using Exclusive (https://rsocks.net/exclusive). There you will find a tremendous pool of the most quality proxies with a 15 minutes rotation period. Package’s size is 50 000 IP. You can choose between Wi-Fi and mobile traffic.

а are held on the servers which belong to hosting services. That’s the reason why IPs of such proxies are static and it significantly decreases their scraping abilities. However, if the target resource is known for its loyalty to any visitors and you are not planning to send too many requests, server proxies might give you good results as well.


The advantages which favorably distinguish server packages from the rest are their speed and stability of connection. Their speed can reach 100 and even 1000 Mbit/s in some rare cases. In combination with unlimited traffic it shows complete excellence with careful scraping and low risks of IPs getting blocked.

We have more than 25 000 server proxies from 14 countries. Minimal size of a package - 1000 IP. For all of our offers we grant up to 99% uptime and stable connection speed. For proxies of the USA, Russia and Germany you can pick a particular city. Prices for these packages start from 9$.

How can I make scraping more effective?

Presented approaches can only be used for full-fledged data collection from target resources with improvements. There are few more advices down below which can help you move to full-scale web-scraping:

  • Collect the data of same type using cycles to receive arrays of the same length;
  • Request several pages in advance. There are plenty of methods to realize this feature. The simplest one is collecting lists and sending requests using a cycle;
  • Collect dissimilar data in different arrays of the same length. This way it will be easier to save them as files. Data scraping in different formats is quite an often thing in commercial tasks;
  • After you make sure your scraping bot is functioning properly, you can stop using a real browser to send requests. It’s way more effective to launch browser without graphic interface which will significantly save your time;
  • Use scraping templates that imitate real visitors' behaviour. You can use additional libraries to achieve it. For example, creating artificial pauses between opening pages or moving down the page imitating scrolling;
  • Connect target pages monitoring to your bot. Some pages’ data can change from time to time or depending on the visitor's status. Periodic visits to pages grant you receiving actual data;
  • Makes web-pages requests more effective with the Requests library. It’s one of the most popular Python tools for network protocols work. You can configure plenty of additional parameters without any significant bot complication. To work in multithreaded mode you can use gRequests library;
  • Use a proxy of a country or a city which fits best for the target resource. It will help you avoid plenty of regional network restrictions.

Congratulations! Work on the scraping bot is completed! Keep developing your projects and use high quality proxies!

If you want to sum up all the information given, you can take a look at the guide: (https://www.youtube.com/watch?v=497Fy7CIBOk):

Do you have any questions left? Look for an answer here!

How do I launch scraping bot in multithreaded mode?

The simplest way which fits best for beginners is the grequests library. It allows performing simple operations in multithreaded mode without paying attention to technical nuances related to threads implementation.

How do I work with forms on the target page?

Convenient and effective tool for web resources interactions is the MechanicalSoup library. Using it you can login into accounts, access search bar and fill out any other possible forms opening more opportunities for parsing.

Which Python tools can also be used for parsing?

There are many Python tools developed for effective parsing. Interesting solutions are full-fledged frameworks which allow to run the whole cycle of operations to collect data from the Web. Scrapy and PySpider can be highlighted among the others. Try using tutorials to begin your work with them: (https://www.scrapingbee.com/blog/web-scraping-101-with-python/#4-web-crawling-frameworks).

How to pick the best proxies for my project?

On RSocks you can request a trial of any package you’d like. All you need is to contact our technical support, our specialists will help you find the best solution for your tasks.

How much are proxies for scraping?

We offer a wide range of packages which fit perfectly for collecting data from the Web. Prices for residential packages start from $1, mobile from $3, server - from $9.

Comments

Sign in to comment
Popular

Together with wide opportunities the Internet carries a number of dangers. First of all, when it comes to anonymity and security.

Initially, the World Wide Web was conceived as a space without borders, where you can get absolutely any information on an anonymous basis.

In today's world, it becomes more difficult to keep personal and corporate data in secret, so the issue of information security is becoming more acute every day.

In the recent past by the standards of the development of information technologies, in 2015 Google created artificial intelligence based on neural networks, which was able to analyze the condition around itself and draw conclusions about its further education. The name of the new offspring from Google was given in abbreviated from the term "deep Q-network" - DQN. The DQN started training in common arcade games (Pakman, Tennis, Space Invaiders, Boksing and other classics).

​The most common methods for organizing network anonymity are the Tor browser and the VPN technology. With their help, a real IP address is hidden, Internet censorship is circumvented and international restrictions are overcome.

New

Receiving big volumes of data from websites for its following analysis plays a key role for plenty of projects. Target resource structure analysis and scraping of the relevant information are often connected to blocks or access restriction issues from website administration.

If you’re looking for a package of residential or mobile proxies with the ability to work with a particular country or ISP, the best option is definitely Exclusive Mix. With it you will be able to download the list which consists of proxies from preliminarily chosen countries and carriers, flexibly filtering it for your needs.

How to web scrape with python? It's a question that many beginners have. At the entry level, the process is quite simple, and anyone can quickly get their project off the ground. However, to successfully work on such a task, you should not forget about many aspects, which are not easy to understand at once.

Did you know that most experts in Internet marketing and e-commerce use specialized browsers? This tool has long been popular in performing tasks that require a high level of anonymity on the Internet.

Proxy server: what is it? Main advantages of working via a virtual “mediator” – anonymity on the web, avoiding all bans, protection against attacks, intellectual property protection

Have you got any question?

Click here and we’ll answer

Trustpilot 4.5