diff options
Diffstat (limited to 'docs/configuration')
| -rw-r--r-- | docs/configuration/caching.md | 37 | ||||
| -rw-r--r-- | docs/configuration/depth.md | 23 | ||||
| -rw-r--r-- | docs/configuration/domain-filter.md | 44 | ||||
| -rw-r--r-- | docs/configuration/link-following.md | 29 | ||||
| -rw-r--r-- | docs/configuration/proxies.md | 13 | ||||
| -rw-r--r-- | docs/configuration/rate-limiting.md | 14 | ||||
| -rw-r--r-- | docs/configuration/starting-url.md | 14 | ||||
| -rw-r--r-- | docs/configuration/url-filter.md | 42 |
8 files changed, 216 insertions, 0 deletions
diff --git a/docs/configuration/caching.md b/docs/configuration/caching.md new file mode 100644 index 0000000..4a06435 --- /dev/null +++ b/docs/configuration/caching.md @@ -0,0 +1,37 @@ +# Caching + +The `cache` config option allows you to enable file-based request caching. When enabled every request cached with its raw response. When the cache is populated and you re-run the scraper, requests will be served directly from cache. + +This also allows you to modify your scraping script afterwards and collect new results immediately. + +Example: + +```javascript +export const config = { + url: "http://example.com/", + cache: "file", + // ... +}; +``` + +### Cache File + +When caching is enabled using the `cache: "file"` option, a `.cache` file will be created with the name of your scraping script. + +Example: + +```bash +$ flyscrape run hackernews.js # Will populate: hackernews.cache +``` + +### Shared cache + +In case you want to share a cache between different scraping scripts, you can specify where to store the cache file. + +```javascript +export const config = { + url: "http://example.com/", + cache: "file:/some/path/shared.cache", + // ... +}; +``` diff --git a/docs/configuration/depth.md b/docs/configuration/depth.md new file mode 100644 index 0000000..cabb0fa --- /dev/null +++ b/docs/configuration/depth.md @@ -0,0 +1,23 @@ +# Depth + +The `depth` config option allows you to specify how deep the scraping process should follow links from the initial URL. + +When no value is provided or `depth` is set to `0` link following is disabled and it will only scrape the initial URL. + +Example: + +```javascript +export const config = { + url: "http://example.com/", + depth: 2, + // ... +}; +``` + +With the config provided in the example the scraper would follow links like this: + +``` +http://example.com/ (depth = 0, initial URL) +↳ http://example.com/deeply (depth = 1) + ↳ http://example.com/deeply/nested (depth = 2) +``` diff --git a/docs/configuration/domain-filter.md b/docs/configuration/domain-filter.md new file mode 100644 index 0000000..e8adc30 --- /dev/null +++ b/docs/configuration/domain-filter.md @@ -0,0 +1,44 @@ +# Domain Filter + +The `allowedDomains` and `blockedDomains` config options allow you to specify a list of domains which are accessible or blocked during scraping. + +```javascript +export const options = { + url: "http://example.com/", + allowedDomains: ["subdomain.example.com"], + // ... +}; +``` + +### `allowedDomains` + +This config option controls which additional domains are allowed to be visted during scraping. The domain of the initial URL is always allowed. + +You can also allow all domains to be accessible by setting `allowedDomains` to `["*"]`. To then further restrict access, you can specify `blockedDomains`. + +Example: + +```javascript +export const options = { + url: "http://example.com/", + allowedDomains: ["*"], + // ... +}; +``` + +### `blockedDomains` + +This config option controls which additional domains are blocked from being accessed. By default all domains other than the domain of the initial URL or those specified in `allowedDomains` are blocked. + +You can best use `blockedDomains` in conjunction with `allowedDomains: ["*"]`, allowing the scraping process to access all domains except what's specified in `blockedDomains`. + +Example: + +```javascript +export const options = { + url: "http://example.com/", + allowedDomains: ["*"], + blockedDomains: ["google.com", "bing.com"], + // ... +}; +``` diff --git a/docs/configuration/link-following.md b/docs/configuration/link-following.md new file mode 100644 index 0000000..6522ce8 --- /dev/null +++ b/docs/configuration/link-following.md @@ -0,0 +1,29 @@ +# Link Following + +The `follow` config option allows you to specify a list of CSS selectors that determine which links the scraper should follow. + +When no value is provided the scraper will follow all links found with the `a[href]` selector. + +Example: + +```javascript +export const config = { + url: "http://example.com/", + follow: [".pagination > a[href]", ".nav a[href]"], + // ... +}; +``` + +### Following non `href` attributes + +For special cases where the link is not to be found in the `href`, you specify a selector with a different ending attribute. + +Example: + +```javascript +export const config = { + url: "http://example.com/", + follow: [".articles > div[data-url]"], + // ... +}; +``` diff --git a/docs/configuration/proxies.md b/docs/configuration/proxies.md new file mode 100644 index 0000000..19434dc --- /dev/null +++ b/docs/configuration/proxies.md @@ -0,0 +1,13 @@ +# Proxies + +The `proxies` config option allows you to specify a list of HTTP(S) proxies that should used during scraping. When multiple proxies are provided, the scraper will prick a proxy at random for each request. + +Example: + +```javascript +export const config = { + url: "http://example.com/", + proxies: ["https://my-proxy.com:3128", "https://my-other-proxy.com:8080"], + // ... +}; +``` diff --git a/docs/configuration/rate-limiting.md b/docs/configuration/rate-limiting.md new file mode 100644 index 0000000..c3014d1 --- /dev/null +++ b/docs/configuration/rate-limiting.md @@ -0,0 +1,14 @@ +# Rate Limiting + +The `rate` config option allows you to specify at which rate the scraper should send out requests. The rate is measured in _Requests per Second_ (RPS) and can be set as a whole or decimal number to account for shorter and longer request intervals. + +When no `rate` is specified, rate limiting is disabled and the scraper will send out requests as fast as it can. + +Example: + +```javascript +export const options = { + url: "http://example.com/", + rate: 50, +}; +``` diff --git a/docs/configuration/starting-url.md b/docs/configuration/starting-url.md new file mode 100644 index 0000000..d5c0965 --- /dev/null +++ b/docs/configuration/starting-url.md @@ -0,0 +1,14 @@ +# Stating URL + +The `url` config option allows you to specify the initial URL at which the scraper should start its scraping process. + +When no value is provided, the scraper will not start and exit immediately. + +Example: + +```javascript +export const config = { + url: "http://example.com/", + // ... +}; +``` diff --git a/docs/configuration/url-filter.md b/docs/configuration/url-filter.md new file mode 100644 index 0000000..e2feda8 --- /dev/null +++ b/docs/configuration/url-filter.md @@ -0,0 +1,42 @@ +# URL Filter + +The `allowedURLs` and `blockedURLs` config options allow you to specify a list of URL patterns (in form of regular expressions) which are accessible or blocked during scraping. + +```javascript +export const options = { + url: "http://example.com/", + allowedURLs: ["/articles/.*", "/authors/.*"], + blockedURLs: ["/authors/admin"], + // ... +}; +``` + +### `allowedURLs` + +This config option controls which URLs are allowed to be visted during scraping. When no value is provided all URLs are allowed to be visited if not otherwise blocked. + +When a list of URL patterns is provided, only URLs matching one or more of these patterns are allowed to be visted. + +Example: + +```javascript +export const options = { + url: "http://example.com/", + allowedURLs: ["/products/"], +}; +``` + +### `blockedURLs` + +This config option controls which URLs are blocked from being visted during scraping. + +When a list of URL patterns is provided, URLs matching one or more of these patterns are blocked from to be visted. + +Example: + +```javascript +export const options = { + url: "http://example.com/", + blockedURLs: ["/restricted"], +}; +``` |