1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|
# Scraping Setup
In this section, we'll delve into the details of setting up your scraping script using the `flyscrape new script.js` command. This command is designed to streamline the process of creating a scraping script, providing you with a structured starting point for your web scraping endeavors.
## The `flyscrape new` Command
The `flyscrape new` command allows you to generate a new scraping script with a predefined structure and sample code. This is incredibly helpful because it provides a quick and easy way to begin your web scraping project.
## Creating a New Scraping Script
To create a new scraping script, use the `flyscrape new` command followed by the desired script filename. For example:
```bash
flyscrape new my_scraping_script.js
```
This command will generate a file named `my_scraping_script.js` in the current directory. You can then open and edit this file with your preferred code editor.
## Script Overview
Let's take a closer look at the structure and components of the generated scraping script:
```javascript
import { parse } from 'flyscrape';
export const options = {
url: 'https://example.com/', // Specify the URL to start scraping from.
depth: 1, // Specify how deep links should be followed. (default = 0, no follow)
allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url)
blockedDomains: [], // Specify the blocked domains. (default = none)
allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed)
blockedURLs: [], // Specify the blocked URLs as regex. (default = non-blocked)
proxy: '', // Specify the HTTP(S) proxy to use. (default = no proxy)
rate: 100 // Specify the rate in requests per second. (default = 100)
};
export default function ({ html, url }) {
const $ = parse(html);
// Your data extraction logic goes here
return {
// Return the structured data you've extracted
};
}
```
## Implementing the Data Extraction Logic
In the generated scraping script, you'll find the comment "// Your data extraction logic goes here." This is the section where you should implement your custom data extraction logic. You can use tools like [Cheerio](https://cheerio.js.org/) or other libraries to navigate and extract data from the parsed HTML.
Here's an example of how you might replace the comment with data extraction code:
```javascript
// Your data extraction logic goes here
const title = $('h1').text();
const description = $('p').text();
// You can extract more data as needed
```
## Returning the Extracted Data
After implementing your data extraction logic, you should structure the data you've extracted and return it from the scraping function. The comment "// Return the structured data you've extracted" is where you should place this code.
Here's an example of how you might return the extracted data:
```javascript
return {
title: title,
description: description
// Add more fields as needed
};
```
With this setup, you can effectively scrape and structure data from web pages to meet your specific requirements.
---
This concludes the "Scraping Setup" section, which provides insights into creating scraping scripts using the `flyscrape new` command, implementing data extraction logic, and returning extracted data. Next, you can explore more advanced topics in the "Development Mode" section to streamline your web scraping workflow.
|