Many larger companies offer free public APIs that provide developers with access to data for them to use in their own projects. But that’s not always the case.Sometimes you need to reach out and scrape the data for yourself!
Web scraper vs Web crawler?
Web scrapers and web crawlers are both tools that can be used to automate data extraction.
Web crawling
Web crawlers are used to read, copy & store the content of websites for archiving or indexing purposes. A practical use case that everybody is familiar with is for powering search engines. Web crawlers can also be used to generate SEO results, monitor SEO analytics and performing website analysis. Web crawling is typically only performed by large corporations like Google.
Web scraping
Web scrapers are used to extract large amounts data from online sources. they’re often used by scientists to gather data for analysis. They can also be used for generating leads, comparing prices, analyzing the stock market, managing brand reputation, market research for new products, and collecting data sets for machine learning.
Ethics & Legality
Sometime web scraping can fall into a legal grey area. You particularly need to be careful to avoid stealing intellectual property or privileged data, and inadvertently executing a denial of service attack.
Before proceeding, you should consider the following:
- Do you have the right to access the data you want to scrape? 2. Is the rate of scraping similar to that which a human could normally consume the data?
Perfectly legal example: using the YouTube API to scrape the comments section of a video to select a random winner for a giveaway.
You’ll know that you have the right to access content if it’s something that you are normally able to view when visiting the site. Check the website’s terms of service before scraping their data if you’re unsure.
If your program creates too many requests to the server you could inadvertently overload the server and crash the site (aka accidental DDOS). A good idea is to limit your requests to 1 request every few seconds, or use an official API that limits activities automatically so you don’t have to worry.
DIY
You will need:
- Node
- Puppeteer
Steps
- Initialize your npm node module (creates a package.json file)
npm init -y
- Install the puppeteer dependency
npm i puppeteer
- Create entry point file.
touch index.js
3b. (optional) Replace the “test” script with a “start” script so that you can run “npm start” to run your program. (If you don’t do this then you can still run node index to run your script.)
// change this:
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
//to this:
"scripts": {
"start": "node index"
- In your index.js file create a require puppeteer variable & an async function that launches the browser, specifies the web page, describes any scraping actions that you want to take, and then closes the browser.
const puppeteer = require('puppeteer')
async function run() {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://www.traversymedia.com')
// this is where you can write your code to do whatever you want
// Screenshots & PDFs
await page.screenshot({ path: 'screenshot.png'}) // window screenshot
await page.screenshot({ path: 'fullPageScreenshot.png' , fullPage: true}) // full page screenshot
await page.pdf({ path: 'fullPagePDF.pdf', format: 'A4'}) // extract the website data into a PDF document
// Extract all of the HTML on the page
const html = await page.content()
console.log(html) // log the data
// Extract a specific element, in this case the title of the page
const title = await page.evaluate(() => document.title)
console.log(title)
// Extract all the text on the page
const text = await page.evaluate(( () => document.body.innerText))
console.log(text)
// Extract all the links on the page
const links = await page.evaluate(() => Array.from(document.querySelectorAll('a'), (e) => e.href))
console.log(links)
//Extract an array with all pf the course titles by specifying h3 elements with the class .card-body within all of the elements with the class .card and a parent element of #courses
const courses = await page.evaluate(() => Array.from(document.querySelectorAll('#courses .card'), (e) => ({
title: e.querySelector('.card-body h3').innerText
})))
await browser.close()
}
run()
If you are looking to extract something specific you may need to go the web page and inspect the code so that you can narrow down your scraping request.
Save data to file
If you are looking to extract something specific you may need to go the web page and inspect the code so that you can narrow down your scraping request.
- Reference the file system module in the head
const fs = require('fs')
- Save the file to JSON
fs.writeFile('courses.json', JSON.stringify(courses), (err) => {
if (err) throw err
console.log('file saved')
})
//saves the courses array into a JSON file instead of logging the data to the console
Source Code
Resources
Further Reading
Web scrapers have gone and scoured the internet collecting information and dumped them into convenient databases for data analysts to access.