Python - Truth About Web Scraping + Scraping Clutch
56: Keeping it real and practical.
EDIT: Apologies for grammatical errors that were present in the “newsletter version” of the article. It happens when you are writing articles after midnight.
Hello ladies, gents, cartoon animals, and others.
For the complete reading experience make sure to open the article in your web browser.
Here is another Python article. Everyone at this point should know that Python is an amazing language. One that has a bigger use case than probably the most famous office tool Excel. Unfortunately, it is not that popular. Yet. Microsoft recently announced they were adding Python to Excel. A positive sign indicating things are getting interesting and the use case will dramatically increase…
Back to the main topic.
Starting with general information on web scraping and what no one is going to tell you about it. After that? We are keeping it light and practical at the same time. Basic guide on how to obtain the data you need from Clutch. Something that even those with no or little experience will be able to pull off.
Yet to learn the Python basics?
Truth About Web (Data) Scraping
You will find us talking about the importance of picking up the WiFi money activity and building your side. Web Scraping is something that has yet to pick up all the attention it deserves. We believe when you compare it with the other “fancier” types of making money on the internet. Web scraping flies under the radar.
Why you might ask?
Especially true when you start considering your options. Here we are discussing what you can do with the data you manage to scrape. There are numerous forums and groups that are buying scraped data. Maybe you want to use it for the cold outreach? There is a possibility you are interested in building up the SaaS solution. Another reason you could scrape data and use it for x or y. Dosen’t matter. What matters is that there is purpose for the specific data you have gathered.
So what is the purpose of this?
If you are new to web scraping with Python. There are quite a few harsh realities that will hit you along the way. We are not going to go fully into details. However, experienced scrapers already know how the web scraping field works. If you think programming is a “fast-paced” environment. One where new technologies, updates, and features are released regularly. Wait until you start uncovering the data scraping. Giving anyone two weeks before they realize that it’s a game of its own.
“Security providers” are taking extensive measures to prevent scraping. Websites using Cloudflare are getting updated each month or two. When we are talking in terms of protection from web scraping. No one is saying it’s impossible to scrape x site because of a particular reason. So far we have managed to scrape quite a few “sources” that were not the easiest targets.
Will you be able to use the same script you used 3 months ago on the same site? Not likely. The reality of web scraping with Python. It’s a great opportunity field that is gaining bigger importance. But changes quickly.
For those who don’t have time for constant changes, reading new material, and testing out new approaches. Do yourself a favor. Keep it simple and stick to easier things such as API Scraping or completely focus on web browser-based scrapers - nothing wrong with it.
Does this mean it’s not worth learning web scraping?
We are not saying that. It’s an extremely valuable “skill” to have. But the better question is it worth it for the average individual? Let’s say for example you have someone that is currently running an SDR agency/lead generation biz or however you are calling it.
What is the main purpose why you need scraping?
Simplicity reasons - let’s say you are scraping pain points (reviews on X site) and once per month LinkedIn groups. Nothing complicated. Something that can be done with the simple built-in web browser scraper mentioned below or with a tool such as PhantomBuster. Would it be worth it for you to constantly research and stay up to date with the latest bypass methods to scrape that data using Python?
We are focusing on average users - not heavy ones. In our opinion? Mostly it would be a waste of time. Better said it is a complete “opportunity loss”. Meaning time could be spent better on other activities that are having more impact on your business.
At the end of the day, it’s on you to decide what approach you should use when it comes to data scraping. As we said it’s not for someone who likes to have a one-time setup and leave it on autopilot. Determined to make money off data scraping and start your WiFi money? Solid starting point.
Not that hard once you get the core concepts down. But you will have to update your knowledge regularly and stay in the loop with the latest information. Repeating ourselves. An underrated field that requires a solid chunk of your time - one with huge potential.
Clutch & How To Scrap Clutch
Clutch.co is a B2B ratings and reviews website in the world of IT services. It is an excellent resource for businesses looking to partner with software development companies, mobile app developers, web designers, digital marketing agencies and other IT specialists. - icraftapps.com
Unsure if we should incorporate additional information about Clutch. It’s a great website and often used in combination with G2 which is their bigger competitor. Make sure to check out the guide on how to take care of G2 with Python.
When it comes to scraping the Clutch. There are three methods we are going to present in the article. Giving you a choice to decide which one will be good enough depending on your needs. Again as with our previous "scraping work”. Trying not to be too technical. Instead focusing more on the “general” approach since we are trying to keep it easy for everyone to be able to follow it.
Before we move forward. Best bang for the buck when it comes to scraping the Clutch? Method #2 you will find further in the article.
Scratching The Surface - Method #1
Similar to what we already covered in the earlier scraping article (which you can find below). The idea here is quite simple. You will only need an instant data scraper to make it work. You are going to mix it with Python and few libraries inside Python such as “Mouse” and “Selenium”. No reason to overcomplicate.
Requirements Method #1
Additional libraries you will need.
pip install selenium
pip install keyboard
pip install pynput
Since we will be using selenium you will need a webdriver for it.
Make sure to save webdriver in the folder where your Python project is located - simplicity reasons.
“A CRX file is a compressed archive file that contains one or more Google Chrome browser extensions.”
Since our method is not as efficient as it used to be back in the day. You will need an additional download better said Chrome extension. We will be searching for a Chrome extension called “Instant Data Scraper”.
To download the CRX file go to the following website and copy/paste the link you grabbed from your extensions store. Hopefully, there is the same extension available on Firefox.
Once you download the CRX file → same principles as in the webdriver example → save it in the folder where your project is located.
There is a reason why we are calling method #1 “scratching the surface”. The reason is that with an approach like this one, you will not be able to dive deeper into review scraping. Maybe it’s possible but we didn’t find an optimal way to do so.
Instead, we choose rather a different approach to do that. More efficient one- depending on your goals and needs. The majority will be satisfied with what you can do with it. Getting yourself a certain industry overview that you are going to enhance with one of the available database tools.
An example of the “scratching the surface” method is visible below.
First step? Target a specific industry - our example is cybersecurity.
Open up the Instant Data Scraper that you should have installed already. The next step is to click on “Try another table” to get it to look like in the example below. Should be a single click on the button. It’s easy to identify which table you are looking for based on the output preview within the Chrome extension.
Once you are done, the next step is to locate the “Next” button. It’s pretty self-explanatory. You must click the “Next” button, which should be found further down on the page. Allowing the scraper to continue gathering data automatically.
The last step is to set the minimum delay to 10 seconds – no need to push your luck. There is always a possibility that Clutch will block you. You will see we didn’t modify it in our Python code.
Click on “Start Crawl” and export the scraped data in the format you desire. Can't be much easier when we are talking in technical terms. The majority will be satisfied with method #1 since they will gather a general overview. One where they will gather a list of companies in wanted industries as well as additional details that are ready to be enriched.
The same method can be used within Python in case you are going to scrape certain industries regularly. This is exactly what we did in the previous guide. You are going to need Selenium and a few other libraries to pull it off. First, read the article below to understand the big picture behind the code provided.
You could easily make code prettier and more efficient. Also, keep in mind Google has changed how they are releasing drivers. Had to make a few changes in our code to pull it off. You will most likely have to update Selenium over pip in case you haven’t in a long time.
“Mouse coordinates” in our code below were based on 1080p resolution with display scaling set to 100%. Meaning if you have any option different you will have to change the code to fit your needs. The same goes for running Linux or Mac.
Full code ready to run.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from pynput.mouse import Controller, Button
import keyboard as kb
driver = webdriver.Chrome()
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver, 20)
#Marks how much time will it take before "next step". Adjust based on your needs
t1 = 4
t2 = 5
t3 = 10
t4 = 150 #EDIT BASED ON YOUR NEEDS
#Industry you want to on Clutch in our case cybersecurity industry
url = driver.get('https://clutch.co/it-services/cybersecurity')
#We are setting up mouse function from pynput
mouse = Controller()
#Get to the bottom of the page
mouse.position = (1231, 357)
#Mouse cordinates pointing to the Extension part
mouse.position = (1795, 62)
#Mouse cordinates opening the Instant Data Scraper
mouse.position = (1600, 214)
#Click on next button option
mouse.position = (235, 160)
#Locate next button
mouse.position = (1130, 772)
#Get back into instant data scraper view
#Edit delay if needed
#mouse.position = (249, 229)
#buffer_to_add = '9'
#Click start crawling
mouse.position = (244, 162)
#Edit based on your needs
#Export to XLSM
mouse.position = (401, 135)
print('You are done!')
Now it's up to you to choose whether you are going to do it "manually" by clicking on all the buttons to scrape. Or you can use Python to do it automatically for you. Again we suggest reading the article below if you haven’t already. The easiest way to understand why we recommend using the instant data scraper.