summaryrefslogtreecommitdiff
path: root/docs/configuration
diff options
context:
space:
mode:
Diffstat (limited to 'docs/configuration')
-rw-r--r--docs/configuration/caching.md37
-rw-r--r--docs/configuration/depth.md23
-rw-r--r--docs/configuration/domain-filter.md44
-rw-r--r--docs/configuration/link-following.md29
-rw-r--r--docs/configuration/proxies.md13
-rw-r--r--docs/configuration/rate-limiting.md14
-rw-r--r--docs/configuration/starting-url.md14
-rw-r--r--docs/configuration/url-filter.md42
8 files changed, 216 insertions, 0 deletions
diff --git a/docs/configuration/caching.md b/docs/configuration/caching.md
new file mode 100644
index 0000000..4a06435
--- /dev/null
+++ b/docs/configuration/caching.md
@@ -0,0 +1,37 @@
+# Caching
+
+The `cache` config option allows you to enable file-based request caching. When enabled every request cached with its raw response. When the cache is populated and you re-run the scraper, requests will be served directly from cache.
+
+This also allows you to modify your scraping script afterwards and collect new results immediately.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ cache: "file",
+ // ...
+};
+```
+
+### Cache File
+
+When caching is enabled using the `cache: "file"` option, a `.cache` file will be created with the name of your scraping script.
+
+Example:
+
+```bash
+$ flyscrape run hackernews.js # Will populate: hackernews.cache
+```
+
+### Shared cache
+
+In case you want to share a cache between different scraping scripts, you can specify where to store the cache file.
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ cache: "file:/some/path/shared.cache",
+ // ...
+};
+```
diff --git a/docs/configuration/depth.md b/docs/configuration/depth.md
new file mode 100644
index 0000000..cabb0fa
--- /dev/null
+++ b/docs/configuration/depth.md
@@ -0,0 +1,23 @@
+# Depth
+
+The `depth` config option allows you to specify how deep the scraping process should follow links from the initial URL.
+
+When no value is provided or `depth` is set to `0` link following is disabled and it will only scrape the initial URL.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ depth: 2,
+ // ...
+};
+```
+
+With the config provided in the example the scraper would follow links like this:
+
+```
+http://example.com/ (depth = 0, initial URL)
+↳ http://example.com/deeply (depth = 1)
+ ↳ http://example.com/deeply/nested (depth = 2)
+```
diff --git a/docs/configuration/domain-filter.md b/docs/configuration/domain-filter.md
new file mode 100644
index 0000000..e8adc30
--- /dev/null
+++ b/docs/configuration/domain-filter.md
@@ -0,0 +1,44 @@
+# Domain Filter
+
+The `allowedDomains` and `blockedDomains` config options allow you to specify a list of domains which are accessible or blocked during scraping.
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedDomains: ["subdomain.example.com"],
+ // ...
+};
+```
+
+### `allowedDomains`
+
+This config option controls which additional domains are allowed to be visted during scraping. The domain of the initial URL is always allowed.
+
+You can also allow all domains to be accessible by setting `allowedDomains` to `["*"]`. To then further restrict access, you can specify `blockedDomains`.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedDomains: ["*"],
+ // ...
+};
+```
+
+### `blockedDomains`
+
+This config option controls which additional domains are blocked from being accessed. By default all domains other than the domain of the initial URL or those specified in `allowedDomains` are blocked.
+
+You can best use `blockedDomains` in conjunction with `allowedDomains: ["*"]`, allowing the scraping process to access all domains except what's specified in `blockedDomains`.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedDomains: ["*"],
+ blockedDomains: ["google.com", "bing.com"],
+ // ...
+};
+```
diff --git a/docs/configuration/link-following.md b/docs/configuration/link-following.md
new file mode 100644
index 0000000..6522ce8
--- /dev/null
+++ b/docs/configuration/link-following.md
@@ -0,0 +1,29 @@
+# Link Following
+
+The `follow` config option allows you to specify a list of CSS selectors that determine which links the scraper should follow.
+
+When no value is provided the scraper will follow all links found with the `a[href]` selector.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ follow: [".pagination > a[href]", ".nav a[href]"],
+ // ...
+};
+```
+
+### Following non `href` attributes
+
+For special cases where the link is not to be found in the `href`, you specify a selector with a different ending attribute.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ follow: [".articles > div[data-url]"],
+ // ...
+};
+```
diff --git a/docs/configuration/proxies.md b/docs/configuration/proxies.md
new file mode 100644
index 0000000..19434dc
--- /dev/null
+++ b/docs/configuration/proxies.md
@@ -0,0 +1,13 @@
+# Proxies
+
+The `proxies` config option allows you to specify a list of HTTP(S) proxies that should used during scraping. When multiple proxies are provided, the scraper will prick a proxy at random for each request.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ proxies: ["https://my-proxy.com:3128", "https://my-other-proxy.com:8080"],
+ // ...
+};
+```
diff --git a/docs/configuration/rate-limiting.md b/docs/configuration/rate-limiting.md
new file mode 100644
index 0000000..c3014d1
--- /dev/null
+++ b/docs/configuration/rate-limiting.md
@@ -0,0 +1,14 @@
+# Rate Limiting
+
+The `rate` config option allows you to specify at which rate the scraper should send out requests. The rate is measured in _Requests per Second_ (RPS) and can be set as a whole or decimal number to account for shorter and longer request intervals.
+
+When no `rate` is specified, rate limiting is disabled and the scraper will send out requests as fast as it can.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ rate: 50,
+};
+```
diff --git a/docs/configuration/starting-url.md b/docs/configuration/starting-url.md
new file mode 100644
index 0000000..d5c0965
--- /dev/null
+++ b/docs/configuration/starting-url.md
@@ -0,0 +1,14 @@
+# Stating URL
+
+The `url` config option allows you to specify the initial URL at which the scraper should start its scraping process.
+
+When no value is provided, the scraper will not start and exit immediately.
+
+Example:
+
+```javascript
+export const config = {
+ url: "http://example.com/",
+ // ...
+};
+```
diff --git a/docs/configuration/url-filter.md b/docs/configuration/url-filter.md
new file mode 100644
index 0000000..e2feda8
--- /dev/null
+++ b/docs/configuration/url-filter.md
@@ -0,0 +1,42 @@
+# URL Filter
+
+The `allowedURLs` and `blockedURLs` config options allow you to specify a list of URL patterns (in form of regular expressions) which are accessible or blocked during scraping.
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedURLs: ["/articles/.*", "/authors/.*"],
+ blockedURLs: ["/authors/admin"],
+ // ...
+};
+```
+
+### `allowedURLs`
+
+This config option controls which URLs are allowed to be visted during scraping. When no value is provided all URLs are allowed to be visited if not otherwise blocked.
+
+When a list of URL patterns is provided, only URLs matching one or more of these patterns are allowed to be visted.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ allowedURLs: ["/products/"],
+};
+```
+
+### `blockedURLs`
+
+This config option controls which URLs are blocked from being visted during scraping.
+
+When a list of URL patterns is provided, URLs matching one or more of these patterns are blocked from to be visted.
+
+Example:
+
+```javascript
+export const options = {
+ url: "http://example.com/",
+ blockedURLs: ["/restricted"],
+};
+```