From 190056ee8d6a4eca61d92a79cc25aad645e69d4a Mon Sep 17 00:00:00 2001 From: Philipp Tanlak Date: Mon, 13 Nov 2023 22:36:15 +0100 Subject: Move docs to flyscrape.com (#11) --- docs/using-flyscrape/scraping-setup.md | 80 ---------------------------------- 1 file changed, 80 deletions(-) delete mode 100644 docs/using-flyscrape/scraping-setup.md (limited to 'docs/using-flyscrape/scraping-setup.md') diff --git a/docs/using-flyscrape/scraping-setup.md b/docs/using-flyscrape/scraping-setup.md deleted file mode 100644 index 2b3183b..0000000 --- a/docs/using-flyscrape/scraping-setup.md +++ /dev/null @@ -1,80 +0,0 @@ -# Scraping Setup - -In this section, we'll delve into the details of setting up your scraping script using the `flyscrape new script.js` command. This command is designed to streamline the process of creating a scraping script, providing you with a structured starting point for your web scraping endeavors. - -## The `flyscrape new` Command - -The `flyscrape new` command allows you to generate a new scraping script with a predefined structure and sample code. This is incredibly helpful because it provides a quick and easy way to begin your web scraping project. - -## Creating a New Scraping Script - -To create a new scraping script, use the `flyscrape new` command followed by the desired script filename. For example: - -```bash -flyscrape new my_scraping_script.js -``` - -This command will generate a file named `my_scraping_script.js` in the current directory. You can then open and edit this file with your preferred code editor. - -## Script Overview - -Let's take a closer look at the structure and components of the generated scraping script: - -```javascript -import { parse } from 'flyscrape'; - -export const options = { - url: 'https://example.com/', // Specify the URL to start scraping from. - depth: 1, // Specify how deep links should be followed. (default = 0, no follow) - allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url) - blockedDomains: [], // Specify the blocked domains. (default = none) - allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed) - blockedURLs: [], // Specify the blocked URLs as regex. (default = non-blocked) - proxy: '', // Specify the HTTP(S) proxy to use. (default = no proxy) - rate: 100 // Specify the rate in requests per second. (default = 100) -}; - -export default function ({ html, url }) { - const $ = parse(html); - - // Your data extraction logic goes here - - return { - // Return the structured data you've extracted - }; -} -``` - -## Implementing the Data Extraction Logic - -In the generated scraping script, you'll find the comment "// Your data extraction logic goes here." This is the section where you should implement your custom data extraction logic. You can use tools like [Cheerio](https://cheerio.js.org/) or other libraries to navigate and extract data from the parsed HTML. - -Here's an example of how you might replace the comment with data extraction code: - -```javascript -// Your data extraction logic goes here -const title = $('h1').text(); -const description = $('p').text(); - -// You can extract more data as needed -``` - -## Returning the Extracted Data - -After implementing your data extraction logic, you should structure the data you've extracted and return it from the scraping function. The comment "// Return the structured data you've extracted" is where you should place this code. - -Here's an example of how you might return the extracted data: - -```javascript -return { - title: title, - description: description - // Add more fields as needed -}; -``` - -With this setup, you can effectively scrape and structure data from web pages to meet your specific requirements. - ---- - -This concludes the "Scraping Setup" section, which provides insights into creating scraping scripts using the `flyscrape new` command, implementing data extraction logic, and returning extracted data. Next, you can explore more advanced topics in the "Development Mode" section to streamline your web scraping workflow. -- cgit v1.2.3