summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorPhilipp Tanlak <philipp.tanlak@gmail.com>2023-10-19 17:54:18 +0200
committerPhilipp Tanlak <philipp.tanlak@gmail.com>2023-10-19 18:21:58 +0200
commit0daefa86b400efe08245f4f2a386f7341b76b24e (patch)
tree60d743bb9734a9d7f46701a5ac9026650a330d49 /docs
parent03b3be0c3bbc70584e8988e1810dc28eacf4521f (diff)
docs: Add documentationv0.3.0
Diffstat (limited to 'docs')
-rw-r--r--docs/configuration/caching.md37
-rw-r--r--docs/configuration/depth.md23
-rw-r--r--docs/configuration/domain-filter.md44
-rw-r--r--docs/configuration/link-following.md29
-rw-r--r--docs/configuration/proxies.md13
-rw-r--r--docs/configuration/rate-limiting.md14
-rw-r--r--docs/configuration/starting-url.md14
-rw-r--r--docs/configuration/url-filter.md42
-rw-r--r--docs/getting-started/installation.md45
-rw-r--r--docs/getting-started/running-flyscrape.md43
-rw-r--r--docs/readme.md22
-rw-r--r--docs/using-flyscrape/development-mode.md53
-rw-r--r--docs/using-flyscrape/scraping-setup.md80
-rw-r--r--docs/using-flyscrape/start-scraping.md51
14 files changed, 510 insertions, 0 deletions
diff --git a/docs/configuration/caching.md b/docs/configuration/caching.md
new file mode 100644
index 0000000..4a06435
--- /dev/null
+++ b/docs/configuration/caching.md
@@ -0,0 +1,37 @@
+# Caching
+
+The `cache` config option allows you to enable file-based request caching. When enabled every request cached with its raw response. When the cache is populated and you re-run the scraper, requests will be served directly from cache.
+
+This also allows you to modify your scraping script afterwards and collect new results immediately.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ cache: "file",
+ // ...
+};
+```
+
+### Cache File
+
+When caching is enabled using the `cache: "file"` option, a `.cache` file will be created with the name of your scraping script.
+
+Example:
+
+```bash
+$ flyscrape run hackernews.js # Will populate: hackernews.cache
+```
+
+### Shared cache
+
+In case you want to share a cache between different scraping scripts, you can specify where to store the cache file.
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ cache: "file:/some/path/shared.cache",
+ // ...
+};
+```
diff --git a/docs/configuration/depth.md b/docs/configuration/depth.md
new file mode 100644
index 0000000..cabb0fa
--- /dev/null
+++ b/docs/configuration/depth.md
@@ -0,0 +1,23 @@
+# Depth
+
+The `depth` config option allows you to specify how deep the scraping process should follow links from the initial URL.
+
+When no value is provided or `depth` is set to `0` link following is disabled and it will only scrape the initial URL.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ depth: 2,
+ // ...
+};
+```
+
+With the config provided in the example the scraper would follow links like this:
+
+```
+http://example.com/ (depth = 0, initial URL)
+↳ http://example.com/deeply (depth = 1)
+ ↳ http://example.com/deeply/nested (depth = 2)
+```
diff --git a/docs/configuration/domain-filter.md b/docs/configuration/domain-filter.md
new file mode 100644
index 0000000..e8adc30
--- /dev/null
+++ b/docs/configuration/domain-filter.md
@@ -0,0 +1,44 @@
+# Domain Filter
+
+The `allowedDomains` and `blockedDomains` config options allow you to specify a list of domains which are accessible or blocked during scraping.
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedDomains: ["subdomain.example.com"],
+ // ...
+};
+```
+
+### `allowedDomains`
+
+This config option controls which additional domains are allowed to be visted during scraping. The domain of the initial URL is always allowed.
+
+You can also allow all domains to be accessible by setting `allowedDomains` to `["*"]`. To then further restrict access, you can specify `blockedDomains`.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedDomains: ["*"],
+ // ...
+};
+```
+
+### `blockedDomains`
+
+This config option controls which additional domains are blocked from being accessed. By default all domains other than the domain of the initial URL or those specified in `allowedDomains` are blocked.
+
+You can best use `blockedDomains` in conjunction with `allowedDomains: ["*"]`, allowing the scraping process to access all domains except what's specified in `blockedDomains`.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedDomains: ["*"],
+ blockedDomains: ["google.com", "bing.com"],
+ // ...
+};
+```
diff --git a/docs/configuration/link-following.md b/docs/configuration/link-following.md
new file mode 100644
index 0000000..6522ce8
--- /dev/null
+++ b/docs/configuration/link-following.md
@@ -0,0 +1,29 @@
+# Link Following
+
+The `follow` config option allows you to specify a list of CSS selectors that determine which links the scraper should follow.
+
+When no value is provided the scraper will follow all links found with the `a[href]` selector.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ follow: [".pagination > a[href]", ".nav a[href]"],
+ // ...
+};
+```
+
+### Following non `href` attributes
+
+For special cases where the link is not to be found in the `href`, you specify a selector with a different ending attribute.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ follow: [".articles > div[data-url]"],
+ // ...
+};
+```
diff --git a/docs/configuration/proxies.md b/docs/configuration/proxies.md
new file mode 100644
index 0000000..19434dc
--- /dev/null
+++ b/docs/configuration/proxies.md
@@ -0,0 +1,13 @@
+# Proxies
+
+The `proxies` config option allows you to specify a list of HTTP(S) proxies that should used during scraping. When multiple proxies are provided, the scraper will prick a proxy at random for each request.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ proxies: ["https://my-proxy.com:3128", "https://my-other-proxy.com:8080"],
+ // ...
+};
+```
diff --git a/docs/configuration/rate-limiting.md b/docs/configuration/rate-limiting.md
new file mode 100644
index 0000000..c3014d1
--- /dev/null
+++ b/docs/configuration/rate-limiting.md
@@ -0,0 +1,14 @@
+# Rate Limiting
+
+The `rate` config option allows you to specify at which rate the scraper should send out requests. The rate is measured in _Requests per Second_ (RPS) and can be set as a whole or decimal number to account for shorter and longer request intervals.
+
+When no `rate` is specified, rate limiting is disabled and the scraper will send out requests as fast as it can.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ rate: 50,
+};
+```
diff --git a/docs/configuration/starting-url.md b/docs/configuration/starting-url.md
new file mode 100644
index 0000000..d5c0965
--- /dev/null
+++ b/docs/configuration/starting-url.md
@@ -0,0 +1,14 @@
+# Stating URL
+
+The `url` config option allows you to specify the initial URL at which the scraper should start its scraping process.
+
+When no value is provided, the scraper will not start and exit immediately.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ // ...
+};
+```
diff --git a/docs/configuration/url-filter.md b/docs/configuration/url-filter.md
new file mode 100644
index 0000000..e2feda8
--- /dev/null
+++ b/docs/configuration/url-filter.md
@@ -0,0 +1,42 @@
+# URL Filter
+
+The `allowedURLs` and `blockedURLs` config options allow you to specify a list of URL patterns (in form of regular expressions) which are accessible or blocked during scraping.
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedURLs: ["/articles/.*", "/authors/.*"],
+ blockedURLs: ["/authors/admin"],
+ // ...
+};
+```
+
+### `allowedURLs`
+
+This config option controls which URLs are allowed to be visted during scraping. When no value is provided all URLs are allowed to be visited if not otherwise blocked.
+
+When a list of URL patterns is provided, only URLs matching one or more of these patterns are allowed to be visted.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedURLs: ["/products/"],
+};
+```
+
+### `blockedURLs`
+
+This config option controls which URLs are blocked from being visted during scraping.
+
+When a list of URL patterns is provided, URLs matching one or more of these patterns are blocked from to be visted.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ blockedURLs: ["/restricted"],
+};
+```
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
new file mode 100644
index 0000000..2d14b7d
--- /dev/null
+++ b/docs/getting-started/installation.md
@@ -0,0 +1,45 @@
+# Installation
+
+Welcome to flyscrape! This section will guide you through the installation process, ensuring you have everything you need to start using this powerful web scraping tool.
+
+## Prerequisites
+
+Before installing flyscrape, make sure you have the following prerequisites:
+
+1. **Go Programming Language**: flyscrape is built with Go, so you need to have Go installed on your system. If you don't have it installed, you can download it from the [official Go website](https://golang.org/).
+
+## Installing flyscrape
+
+Once you have Go installed, you can proceed with installing flyscrape. Follow these simple steps:
+
+1. **Open a Terminal**: Launch your terminal or command prompt.
+
+2. **Install flyscrape**: Run the following command to install flyscrape using Go's package manager:
+
+ ```bash
+ go install github.com/philippta/flyscrape/cmd/flyscrape@latest
+ ```
+
+ This command will download the latest version of flyscrape and install it on your system.
+
+3. **Verify Installation**: To confirm that flyscrape has been successfully installed, you can run the following command:
+
+ ```bash
+ flyscrape
+ ```
+
+ If the installation was successful, this command will display the help text of flyscrape.
+
+## Updating flyscrape abc
+
+To keep your flyscrape installation up to date, you can use the following command:
+
+```bash
+go get -u github.com/philippta/flyscrape/cmd/flyscrape@latest
+```
+
+This command will fetch and install the latest version of flyscrape from the GitHub repository.
+
+Congratulations! You've successfully installed flyscrape on your system. Now you're ready to dive into web scraping and start extracting data from websites with ease.
+
+Next, you can learn how to run flyscrape and start creating your scraping scripts by following the "Running flyscrape" section in the "Getting Started" category.
diff --git a/docs/getting-started/running-flyscrape.md b/docs/getting-started/running-flyscrape.md
new file mode 100644
index 0000000..c042e4e
--- /dev/null
+++ b/docs/getting-started/running-flyscrape.md
@@ -0,0 +1,43 @@
+# Running flyscrape
+
+Once you've successfully installed flyscrape, you're ready to start using it to scrape data from websites. This section will guide you through the basic commands and steps to run flyscrape effectively.
+
+## Creating Your First Scraping Script
+
+Before you can run flyscrape, you'll need a scraping script to tell it what data to extract from a website. To create a new scraping script, you can use the `new` command followed by the script's filename. Here's an example:
+
+```bash
+flyscrape new my_first_script.js
+```
+
+This command will generate a sample scraping script named `my_first_script.js` in the current directory. You can edit this script to customize it according to your scraping needs.
+
+## Running a Scraping Script
+
+To execute a scraping script, you can use the `run` command followed by the script's filename. For example:
+
+```bash
+flyscrape run my_first_script.js
+```
+
+When you run this command, flyscrape will start retrieving and processing data from the website specified in your script.
+
+## Watching for Development
+
+During the development phase, you may want to make changes to your scraping script and see the results quickly. flyscrape provides a convenient way to do this using the `dev` command. It allows you to watch your scraping script for changes and automatically re-run it when you save your changes.
+
+Here's how to use the `dev` command:
+
+```bash
+flyscrape dev my_first_script.js
+```
+
+With the development mode active, you can iterate on your scraping script, fine-tune your data extraction queries, and see the results in real-time.
+
+## Script Output
+
+After running a scraping script, flyscrape will generate structured data based on your script's logic. You can customize your script to specify what data you want to extract and how it should be formatted.
+
+Congratulations! You've learned how to run flyscrape and create your first scraping script. You can now start gathering data from websites and transforming it into structured information.
+
+Next, explore more advanced topics in the "Using flyscrape" section to refine your scraping skills and learn how to set up more complex scraping scenarios.
diff --git a/docs/readme.md b/docs/readme.md
new file mode 100644
index 0000000..cb9eb5b
--- /dev/null
+++ b/docs/readme.md
@@ -0,0 +1,22 @@
+# Documentation
+
+#### Getting Started
+
+- [Installation](getting-started/installation.md)
+- [Running flyscrape](getting-started/running-flyscrape.md)
+
+#### Using flyscrape
+
+- [Scraping Setup](using-flyscrape/scraping-setup.md)
+- [Development Mode](using-flyscrape/development-mode.md)
+- [Start Scraping](using-flyscrape/start-scraping.md)
+
+#### Configuration
+- [Starting URL](config/starting-url.md)
+- [Depth](config/depth.md)
+- [Link Following](config/link-following.md)
+- [Domain Filter](config/domain-filter.md)
+- [URL Filter](config/url-filter.md)
+- [Rate Limiting](config/rate-limiting.md)
+- [Caching](config/caching.md)
+- [Proxies](config/proxies.md)
diff --git a/docs/using-flyscrape/development-mode.md b/docs/using-flyscrape/development-mode.md
new file mode 100644
index 0000000..b2da076
--- /dev/null
+++ b/docs/using-flyscrape/development-mode.md
@@ -0,0 +1,53 @@
+# Development Mode
+
+Development Mode in flyscrape allows you to streamline the process of creating and fine-tuning your scraping scripts. With the `flyscrape dev` command, you can watch your scraping script for changes and see the results in real-time, making it easier to iterate and perfect your data extraction process during development.
+
+## The `flyscrape dev` Command
+
+The `flyscrape dev` command is a powerful tool that enhances your development workflow by automating the execution of your scraping script when changes are detected. This feature is incredibly useful for several reasons:
+
+1. **Immediate Feedback**: With Development Mode, you can make changes to your scraping script and instantly see the impact of those changes. There's no need to manually run the script after each modification.
+
+2. **Efficiency**: It eliminates the need to repeatedly run the `flyscrape run` command while you fine-tune your scraping logic. This boosts your efficiency and accelerates development.
+
+3. **Real-time Debugging**: If you encounter issues or unexpected behavior in your scraping script, you can quickly identify and fix problems with real-time feedback.
+
+## Using the `flyscrape dev` Command
+
+To activate Development Mode, use the `flyscrape dev` command followed by the name of your scraping script. For example:
+
+```bash
+flyscrape dev my_scraping_script.js
+```
+
+This command will start watching your scraping script file (`my_scraping_script.js` in this case) for changes. Whenever you save changes to the script, flyscrape will automatically re-run it, allowing you to view the updated results in your terminal.
+
+## Tips for Development Mode
+
+Here are some tips to make the most of Development Mode:
+
+1. **Keep Your Editor Open**: Keep your code editor open and edit your scraping script as needed. When you save the changes, flyscrape will automatically pick them up.
+
+2. **Console Output**: Use `console.log()` statements within your scraping script to output debugging information to the console. This can be helpful for diagnosing issues.
+
+3. **Iterate and Experiment**: Take advantage of Development Mode to experiment with different data extraction queries and strategies. The rapid feedback loop makes it easy to iterate and find the right approach.
+
+## Example Workflow
+
+Here's an example of how a typical workflow might look in Development Mode:
+
+1. Create a new scraping script using `flyscrape new`.
+
+2. Use `flyscrape dev` to start watching the script.
+
+3. Edit the script, add data extraction logic, and save the changes.
+
+4. Observe the results in real-time in the terminal.
+
+5. If needed, make further changes and continue iterating until you achieve the desired data extraction results.
+
+Development Mode is an invaluable tool for scraping script development, enabling you to build and refine your scripts efficiently and effectively.
+
+---
+
+This concludes the "Development Mode" section, which demonstrates how to use the `flyscrape dev` command to streamline your scraping script development process. Next, you can explore how to initiate scraping with the "Start scraping" section to gather data from websites.
diff --git a/docs/using-flyscrape/scraping-setup.md b/docs/using-flyscrape/scraping-setup.md
new file mode 100644
index 0000000..2b3183b
--- /dev/null
+++ b/docs/using-flyscrape/scraping-setup.md
@@ -0,0 +1,80 @@
+# Scraping Setup
+
+In this section, we'll delve into the details of setting up your scraping script using the `flyscrape new script.js` command. This command is designed to streamline the process of creating a scraping script, providing you with a structured starting point for your web scraping endeavors.
+
+## The `flyscrape new` Command
+
+The `flyscrape new` command allows you to generate a new scraping script with a predefined structure and sample code. This is incredibly helpful because it provides a quick and easy way to begin your web scraping project.
+
+## Creating a New Scraping Script
+
+To create a new scraping script, use the `flyscrape new` command followed by the desired script filename. For example:
+
+```bash
+flyscrape new my_scraping_script.js
+```
+
+This command will generate a file named `my_scraping_script.js` in the current directory. You can then open and edit this file with your preferred code editor.
+
+## Script Overview
+
+Let's take a closer look at the structure and components of the generated scraping script:
+
+```javascript
+import { parse } from 'flyscrape';
+
+export const options = {
+ url: 'https://example.com/', // Specify the URL to start scraping from.
+ depth: 1, // Specify how deep links should be followed. (default = 0, no follow)
+ allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url)
+ blockedDomains: [], // Specify the blocked domains. (default = none)
+ allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed)
+ blockedURLs: [], // Specify the blocked URLs as regex. (default = non-blocked)
+ proxy: '', // Specify the HTTP(S) proxy to use. (default = no proxy)
+ rate: 100 // Specify the rate in requests per second. (default = 100)
+};
+
+export default function ({ html, url }) {
+ const $ = parse(html);
+
+ // Your data extraction logic goes here
+
+ return {
+ // Return the structured data you've extracted
+ };
+}
+```
+
+## Implementing the Data Extraction Logic
+
+In the generated scraping script, you'll find the comment "// Your data extraction logic goes here." This is the section where you should implement your custom data extraction logic. You can use tools like [Cheerio](https://cheerio.js.org/) or other libraries to navigate and extract data from the parsed HTML.
+
+Here's an example of how you might replace the comment with data extraction code:
+
+```javascript
+// Your data extraction logic goes here
+const title = $('h1').text();
+const description = $('p').text();
+
+// You can extract more data as needed
+```
+
+## Returning the Extracted Data
+
+After implementing your data extraction logic, you should structure the data you've extracted and return it from the scraping function. The comment "// Return the structured data you've extracted" is where you should place this code.
+
+Here's an example of how you might return the extracted data:
+
+```javascript
+return {
+ title: title,
+ description: description
+ // Add more fields as needed
+};
+```
+
+With this setup, you can effectively scrape and structure data from web pages to meet your specific requirements.
+
+---
+
+This concludes the "Scraping Setup" section, which provides insights into creating scraping scripts using the `flyscrape new` command, implementing data extraction logic, and returning extracted data. Next, you can explore more advanced topics in the "Development Mode" section to streamline your web scraping workflow.
diff --git a/docs/using-flyscrape/start-scraping.md b/docs/using-flyscrape/start-scraping.md
new file mode 100644
index 0000000..97b92cc
--- /dev/null
+++ b/docs/using-flyscrape/start-scraping.md
@@ -0,0 +1,51 @@
+# Start Scraping
+
+In this section, we'll dive into the process of initiating web scraping using flyscrape. Now that you have created and fine-tuned your scraping script, it's time to run it and start gathering data from websites.
+
+## The `flyscrape run` Command
+
+The `flyscrape run` command is used to execute your scraping script and retrieve data from the specified website. This command is your gateway to turning your scraping logic into actionable results.
+
+## Running Your Scraping Script
+
+To run your scraping script, simply use the `flyscrape run` command followed by the name of your script file. For example:
+
+```bash
+flyscrape run my_scraping_script.js
+```
+
+This command will initiate the scraping process as defined in your script. Flyscrape will execute your script and stream the JSON output of the extracted data directly to your terminal.
+
+## Saving Scraped Data to a File
+
+You can easily save the JSON output of the scraped data to a file using standard shell redirection. For example, to save the scraped data to a file named `result.json`, you can use the following command:
+
+```bash
+flyscrape run my_scraping_script.js > result.json
+```
+
+This command will execute your scraping script and save the extracted data in the `result.json` file in the current directory.
+
+## Example Workflow
+
+Here's a simple workflow for starting web scraping with flyscrape, including saving the scraped data to a file:
+
+1. Create a scraping script using `flyscrape new` and fine-tune it using `flyscrape dev`.
+
+2. Save your script.
+
+3. Run the script using `flyscrape run`.
+
+4. Observe the terminal as flyscrape streams the JSON output of the extracted data in real-time.
+
+5. If you want to save the data to a file, use redirection as shown above (`flyscrape run my_scraping_script.js > result.json`).
+
+6. Customize the script to store, process, or further analyze the extracted data as needed.
+
+7. Continue scraping or iterate on your script for more complex scenarios.
+
+With this workflow, you can efficiently gather and process data from websites using flyscrape, with the option to save the extracted data to a file for later use or analysis.
+
+---
+
+This concludes the "Start Scraping" section, which covers the process of initiating web scraping with the `flyscrape run` command, including an example of how to save the scraped data to a file. Next, you can explore various configuration options and advanced features in the "Options" section to further tailor your scraping experience.