diff options
| author | Philipp Tanlak <philipp.tanlak@gmail.com> | 2023-11-13 22:36:15 +0100 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2023-11-13 22:36:15 +0100 |
| commit | 190056ee8d6a4eca61d92a79cc25aad645e69d4a (patch) | |
| tree | 423cb3dfb7ca92e4c1c48c1070f553bbadc4d890 /docs/configuration | |
| parent | eae10426cd805ecc0a0459b61639e48e6cd913ad (diff) | |
Move docs to flyscrape.com (#11)
Diffstat (limited to 'docs/configuration')
| -rw-r--r-- | docs/configuration/caching.md | 37 | ||||
| -rw-r--r-- | docs/configuration/depth.md | 23 | ||||
| -rw-r--r-- | docs/configuration/domain-filter.md | 44 | ||||
| -rw-r--r-- | docs/configuration/link-following.md | 29 | ||||
| -rw-r--r-- | docs/configuration/proxies.md | 13 | ||||
| -rw-r--r-- | docs/configuration/rate-limiting.md | 14 | ||||
| -rw-r--r-- | docs/configuration/starting-url.md | 14 | ||||
| -rw-r--r-- | docs/configuration/url-filter.md | 42 |
8 files changed, 0 insertions, 216 deletions
diff --git a/docs/configuration/caching.md b/docs/configuration/caching.md deleted file mode 100644 index 4a06435..0000000 --- a/docs/configuration/caching.md +++ /dev/null @@ -1,37 +0,0 @@ -# Caching - -The `cache` config option allows you to enable file-based request caching. When enabled every request cached with its raw response. When the cache is populated and you re-run the scraper, requests will be served directly from cache. - -This also allows you to modify your scraping script afterwards and collect new results immediately. - -Example: - -```javascript -export const config = { - url: "http://example.com/", - cache: "file", - // ... -}; -``` - -### Cache File - -When caching is enabled using the `cache: "file"` option, a `.cache` file will be created with the name of your scraping script. - -Example: - -```bash -$ flyscrape run hackernews.js # Will populate: hackernews.cache -``` - -### Shared cache - -In case you want to share a cache between different scraping scripts, you can specify where to store the cache file. - -```javascript -export const config = { - url: "http://example.com/", - cache: "file:/some/path/shared.cache", - // ... -}; -``` diff --git a/docs/configuration/depth.md b/docs/configuration/depth.md deleted file mode 100644 index cabb0fa..0000000 --- a/docs/configuration/depth.md +++ /dev/null @@ -1,23 +0,0 @@ -# Depth - -The `depth` config option allows you to specify how deep the scraping process should follow links from the initial URL. - -When no value is provided or `depth` is set to `0` link following is disabled and it will only scrape the initial URL. - -Example: - -```javascript -export const config = { - url: "http://example.com/", - depth: 2, - // ... -}; -``` - -With the config provided in the example the scraper would follow links like this: - -``` -http://example.com/ (depth = 0, initial URL) -↳ http://example.com/deeply (depth = 1) - ↳ http://example.com/deeply/nested (depth = 2) -``` diff --git a/docs/configuration/domain-filter.md b/docs/configuration/domain-filter.md deleted file mode 100644 index e8adc30..0000000 --- a/docs/configuration/domain-filter.md +++ /dev/null @@ -1,44 +0,0 @@ -# Domain Filter - -The `allowedDomains` and `blockedDomains` config options allow you to specify a list of domains which are accessible or blocked during scraping. - -```javascript -export const options = { - url: "http://example.com/", - allowedDomains: ["subdomain.example.com"], - // ... -}; -``` - -### `allowedDomains` - -This config option controls which additional domains are allowed to be visted during scraping. The domain of the initial URL is always allowed. - -You can also allow all domains to be accessible by setting `allowedDomains` to `["*"]`. To then further restrict access, you can specify `blockedDomains`. - -Example: - -```javascript -export const options = { - url: "http://example.com/", - allowedDomains: ["*"], - // ... -}; -``` - -### `blockedDomains` - -This config option controls which additional domains are blocked from being accessed. By default all domains other than the domain of the initial URL or those specified in `allowedDomains` are blocked. - -You can best use `blockedDomains` in conjunction with `allowedDomains: ["*"]`, allowing the scraping process to access all domains except what's specified in `blockedDomains`. - -Example: - -```javascript -export const options = { - url: "http://example.com/", - allowedDomains: ["*"], - blockedDomains: ["google.com", "bing.com"], - // ... -}; -``` diff --git a/docs/configuration/link-following.md b/docs/configuration/link-following.md deleted file mode 100644 index 6522ce8..0000000 --- a/docs/configuration/link-following.md +++ /dev/null @@ -1,29 +0,0 @@ -# Link Following - -The `follow` config option allows you to specify a list of CSS selectors that determine which links the scraper should follow. - -When no value is provided the scraper will follow all links found with the `a[href]` selector. - -Example: - -```javascript -export const config = { - url: "http://example.com/", - follow: [".pagination > a[href]", ".nav a[href]"], - // ... -}; -``` - -### Following non `href` attributes - -For special cases where the link is not to be found in the `href`, you specify a selector with a different ending attribute. - -Example: - -```javascript -export const config = { - url: "http://example.com/", - follow: [".articles > div[data-url]"], - // ... -}; -``` diff --git a/docs/configuration/proxies.md b/docs/configuration/proxies.md deleted file mode 100644 index 19434dc..0000000 --- a/docs/configuration/proxies.md +++ /dev/null @@ -1,13 +0,0 @@ -# Proxies - -The `proxies` config option allows you to specify a list of HTTP(S) proxies that should used during scraping. When multiple proxies are provided, the scraper will prick a proxy at random for each request. - -Example: - -```javascript -export const config = { - url: "http://example.com/", - proxies: ["https://my-proxy.com:3128", "https://my-other-proxy.com:8080"], - // ... -}; -``` diff --git a/docs/configuration/rate-limiting.md b/docs/configuration/rate-limiting.md deleted file mode 100644 index c3014d1..0000000 --- a/docs/configuration/rate-limiting.md +++ /dev/null @@ -1,14 +0,0 @@ -# Rate Limiting - -The `rate` config option allows you to specify at which rate the scraper should send out requests. The rate is measured in _Requests per Second_ (RPS) and can be set as a whole or decimal number to account for shorter and longer request intervals. - -When no `rate` is specified, rate limiting is disabled and the scraper will send out requests as fast as it can. - -Example: - -```javascript -export const options = { - url: "http://example.com/", - rate: 50, -}; -``` diff --git a/docs/configuration/starting-url.md b/docs/configuration/starting-url.md deleted file mode 100644 index d5c0965..0000000 --- a/docs/configuration/starting-url.md +++ /dev/null @@ -1,14 +0,0 @@ -# Stating URL - -The `url` config option allows you to specify the initial URL at which the scraper should start its scraping process. - -When no value is provided, the scraper will not start and exit immediately. - -Example: - -```javascript -export const config = { - url: "http://example.com/", - // ... -}; -``` diff --git a/docs/configuration/url-filter.md b/docs/configuration/url-filter.md deleted file mode 100644 index e2feda8..0000000 --- a/docs/configuration/url-filter.md +++ /dev/null @@ -1,42 +0,0 @@ -# URL Filter - -The `allowedURLs` and `blockedURLs` config options allow you to specify a list of URL patterns (in form of regular expressions) which are accessible or blocked during scraping. - -```javascript -export const options = { - url: "http://example.com/", - allowedURLs: ["/articles/.*", "/authors/.*"], - blockedURLs: ["/authors/admin"], - // ... -}; -``` - -### `allowedURLs` - -This config option controls which URLs are allowed to be visted during scraping. When no value is provided all URLs are allowed to be visited if not otherwise blocked. - -When a list of URL patterns is provided, only URLs matching one or more of these patterns are allowed to be visted. - -Example: - -```javascript -export const options = { - url: "http://example.com/", - allowedURLs: ["/products/"], -}; -``` - -### `blockedURLs` - -This config option controls which URLs are blocked from being visted during scraping. - -When a list of URL patterns is provided, URLs matching one or more of these patterns are blocked from to be visted. - -Example: - -```javascript -export const options = { - url: "http://example.com/", - blockedURLs: ["/restricted"], -}; -``` |