diff options
| author | Philipp Tanlak <philipp.tanlak@gmail.com> | 2025-11-24 20:54:57 +0100 |
|---|---|---|
| committer | Philipp Tanlak <philipp.tanlak@gmail.com> | 2025-11-24 20:57:48 +0100 |
| commit | b1e2c8fd5cb5dfa46bc440a12eafaf56cd844b1c (patch) | |
| tree | 49d360fd6cbc6a2754efe93524ac47ff0fbe0f7d /content/docs | |
Docs
Diffstat (limited to 'content/docs')
| -rw-r--r-- | content/docs/_index.md | 34 | ||||
| -rw-r--r-- | content/docs/api-reference.md | 61 | ||||
| -rw-r--r-- | content/docs/configuration/_index.md | 25 | ||||
| -rw-r--r-- | content/docs/configuration/browser-mode.md | 40 | ||||
| -rw-r--r-- | content/docs/configuration/caching.md | 36 | ||||
| -rw-r--r-- | content/docs/configuration/concurrency.md | 18 | ||||
| -rw-r--r-- | content/docs/configuration/cookies.md | 36 | ||||
| -rw-r--r-- | content/docs/configuration/depth.md | 24 | ||||
| -rw-r--r-- | content/docs/configuration/domain-filter.md | 43 | ||||
| -rw-r--r-- | content/docs/configuration/headers.md | 17 | ||||
| -rw-r--r-- | content/docs/configuration/link-following.md | 33 | ||||
| -rw-r--r-- | content/docs/configuration/output.md | 47 | ||||
| -rw-r--r-- | content/docs/configuration/proxies.md | 33 | ||||
| -rw-r--r-- | content/docs/configuration/rate-limiting.md | 15 | ||||
| -rw-r--r-- | content/docs/configuration/retry.md | 26 | ||||
| -rw-r--r-- | content/docs/configuration/starting-url.md | 29 | ||||
| -rw-r--r-- | content/docs/configuration/url-filter.md | 42 | ||||
| -rw-r--r-- | content/docs/full-example-script.md | 115 | ||||
| -rw-r--r-- | content/docs/getting-started.md | 123 | ||||
| -rw-r--r-- | content/docs/installation.md | 91 |
20 files changed, 888 insertions, 0 deletions
diff --git a/content/docs/_index.md b/content/docs/_index.md new file mode 100644 index 0000000..184d57d --- /dev/null +++ b/content/docs/_index.md @@ -0,0 +1,34 @@ +--- +title: 'Documentation' +linkTitle: 'Documentation' +sidebar: + open: true +--- + +## Introduction + +{{< cards >}} + {{< card link="getting-started" title="Getting started" icon="play" >}} + {{< card link="installation" title="Installation" icon="cog" >}} + {{< card link="reference-script" title="Reference Script" icon="clipboard-check" >}} + {{< card link="api-reference" title="API Reference" icon="credit-card" >}} +{{< /cards >}} + +## Configuration + +{{< cards >}} + {{< card link="configuration/starting-url" title="Starting URL" icon="play" >}} + {{< card link="configuration/depth" title="Depth" icon="arrow-down" >}} + {{< card link="configuration/domain-filter" title="Domain Filter" icon="cube-transparent" >}} + {{< card link="configuration/url-filter" title="URL Filter" icon="sparkles" >}} + {{< card link="configuration/link-following" title="Link Following" icon="link" >}} + {{< card link="configuration/concurrency" title="Concurrency" icon="paper-airplane" >}} + {{< card link="configuration/rate-limiting" title="Rate Limiting" icon="chart-square-bar" >}} + {{< card link="configuration/retry" title="Retry" icon="refresh" >}} + {{< card link="configuration/caching" title="Caching" icon="template" >}} + {{< card link="configuration/proxies" title="Proxies" icon="server" >}} + {{< card link="configuration/cookies" title="Cookies" icon="finger-print" >}} + {{< card link="configuration/headers" title="Headers" icon="sort-ascending" >}} + {{< card link="configuration/browser-mode" title="Browser Mode" icon="desktop-computer" >}} + {{< card link="configuration/output" title="Output File and Format" icon="presentation-chart-bar" >}} +{{< /cards >}} diff --git a/content/docs/api-reference.md b/content/docs/api-reference.md new file mode 100644 index 0000000..99ac2f9 --- /dev/null +++ b/content/docs/api-reference.md @@ -0,0 +1,61 @@ +--- +title: 'API Reference' +weight: 4 +next: '/docs/configuration' +--- + +## Query API + +```javascript{filename="Reference"} +// <div class="element" foo="bar">Hey</div> +const el = doc.find(".element") +el.text() // "Hey" +el.html() // `<div class="element">Hey</div>` +el.attr("foo") // "bar" +el.hasAttr("foo") // true +el.hasClass("element") // true + +// <ul> +// <li class="a">Item 1</li> +// <li>Item 2</li> +// <li>Item 3</li> +// </ul> +const list = doc.find("ul") +list.children() // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>] + +const items = list.find("li") +items.length() // 3 +items.first() // <li>Item 1</li> +items.last() // <li>Item 3</li> +items.get(1) // <li>Item 2</li> +items.get(1).prev() // <li>Item 1</li> +items.get(1).next() // <li>Item 3</li> +items.get(1).parent() // <ul>...</ul> +items.get(1).siblings() // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>] +items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"] +items.filter(item => item.hasClass("a")) // [<li class="a">Item 1</li>] +``` + +## Document Parsing + +```javascript{filename="Reference"} +import { parse } from "flyscrape"; + +const doc = parse(`<div class="foo">bar</div>`); +const text = doc.find(".foo").text(); +``` + +## File Downloads + +```javascript{filename="Reference"} +import { download } from "flyscrape/http"; + +download("http://example.com/image.jpg") // downloads as "image.jpg" +download("http://example.com/image.jpg", "other.jpg") // downloads as "other.jpg" +download("http://example.com/image.jpg", "dir/") // downloads as "dir/image.jpg" + +// If the server offers a filename via the Content-Disposition header and no +// destination filename is provided, Flyscrape will honor the suggested filename. +// E.g. `Content-Disposition: attachment; filename="archive.zip"` +download("http://example.com/generate_archive.php", "dir/") // downloads as "dir/archive.zip" +``` diff --git a/content/docs/configuration/_index.md b/content/docs/configuration/_index.md new file mode 100644 index 0000000..ac27c8b --- /dev/null +++ b/content/docs/configuration/_index.md @@ -0,0 +1,25 @@ +--- +title: 'Configuration' +weight: 4 +sidebar: + open: true +next: '/docs/configuration/starting-url' +prev: '/docs/api-reference' +--- + +{{< cards >}} + {{< card link="starting-url" title="Starting URL" icon="play" >}} + {{< card link="depth" title="Depth" icon="arrow-down" >}} + {{< card link="domain-filter" title="Domain Filter" icon="cube-transparent" >}} + {{< card link="url-filter" title="URL Filter" icon="sparkles" >}} + {{< card link="link-following" title="Link Following" icon="link" >}} + {{< card link="concurrency" title="Concurrency" icon="paper-airplane" >}} + {{< card link="rate-limiting" title="Rate Limiting" icon="chart-square-bar" >}} + {{< card link="retry" title="Retry" icon="refresh" >}} + {{< card link="caching" title="Caching" icon="template" >}} + {{< card link="proxies" title="Proxies" icon="server" >}} + {{< card link="cookies" title="Cookies" icon="finger-print" >}} + {{< card link="headers" title="Headers" icon="sort-ascending" >}} + {{< card link="browser-mode" title="Browser Mode" icon="desktop-computer" >}} + {{< card link="output" title="Output File and Format" icon="presentation-chart-bar" >}} +{{< /cards >}} diff --git a/content/docs/configuration/browser-mode.md b/content/docs/configuration/browser-mode.md new file mode 100644 index 0000000..bbb2c1e --- /dev/null +++ b/content/docs/configuration/browser-mode.md @@ -0,0 +1,40 @@ +--- +title: 'Browser Mode' +weight: 10 +--- + +The Browser Mode controls the interaction with a headless Chromium browser. Enabling the browser mode allows `flyscrape` to download a Chromium browser once and use it to render JavaScript-heavy pages. + +## Browser Mode + +To enable Browser Mode, set the `browser` option to `true` in your configuration. This allows `flyscrape` to use a headless Chromium browser for rendering JavaScript during the scraping process. + +```javascript {filename="Configuration"} +export const config = { + browser: true, +}; +``` + +In the above example, Browser Mode is enabled, allowing `flyscrape` to render pages that rely on JavaScript execution. + +## Headless Option + +The `headless` option, when combined with Browser Mode, controls whether the Chromium browser should run in headless mode or not. Headless mode means the browser operates without a graphical user interface, which can be useful for background processes. + +```javascript {filename="Configuration"} +export const config = { + browser: true, + headless: false, +}; +``` + +In this example, the Chromium browser will run in non-headless mode. If you set `headless` to `true`, the browser will run without a visible GUI. + +```javascript {filename="Configuration"} +export const config = { + browser: true, + headless: true, +}; +``` + +In this example, the Chromium browser will run in headless mode, suitable for scenarios where graphical rendering is unnecessary. diff --git a/content/docs/configuration/caching.md b/content/docs/configuration/caching.md new file mode 100644 index 0000000..2c6766a --- /dev/null +++ b/content/docs/configuration/caching.md @@ -0,0 +1,36 @@ +--- +title: 'Caching' +weight: 7 +--- + +The `cache` config option allows you to enable file-based request caching. When enabled every request cached with its raw response. When the cache is populated and you re-run the scraper, requests will be served directly from cache. + +This also allows you to modify your scraping script afterwards and collect new results immediately. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + cache: "file", + // ... +}; +``` + +### Cache File + +When caching is enabled using the `cache: "file"` option, a `.cache` file will be created with the name of your scraping script. + +```bash {filename="Terminal"} +$ flyscrape run hackernews.js # Will populate: hackernews.cache +``` + +### Shared cache + +In case you want to share a cache between different scraping scripts, you can specify where to store the cache file. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + cache: "file:/some/path/shared.cache", + // ... +}; +``` diff --git a/content/docs/configuration/concurrency.md b/content/docs/configuration/concurrency.md new file mode 100644 index 0000000..0e5e181 --- /dev/null +++ b/content/docs/configuration/concurrency.md @@ -0,0 +1,18 @@ +--- +title: 'Concurrency' +weight: 6 +--- + +The concurrency setting controls the number of simultaneous requests that the scraper can make. This is specified in the configuration object of your scraping script. + +```javascript +export const config = { + // Specify the number of concurrent requests. + concurrency: 5, +}; +``` + +In the above example, the scraper will make up to 5 requests at the same time. + +If the concurrency setting is not specified, there is no limit to the number of concurrent requests. + diff --git a/content/docs/configuration/cookies.md b/content/docs/configuration/cookies.md new file mode 100644 index 0000000..f73d495 --- /dev/null +++ b/content/docs/configuration/cookies.md @@ -0,0 +1,36 @@ +--- +title: 'Cookies' +weight: 9 +--- + +The Cookies configuration in the `flyscrape` script's configuration object allows you to specify the behavior of the cookie store during the scraping process. Cookies are often used for authentication and session management on websites. + +## Cookies Configuration + +To configure the cookie store behavior, set the `cookies` field in your configuration. The `cookies` option supports three values: `"chrome"`, `"edge"`, and `"firefox"`. Each value corresponds to using the cookie store of the respective local browser. + +When the `cookies` option is set to `"chrome"`, `"edge"`, or `"firefox"`, `flyscrape` utilizes the cookie store of the user's installed browser. + +```javascript {filename="Configuration"} +export const config = { + cookies: "chrome", +}; +``` + +In the above example, the `cookies` option is set to `"chrome"`, indicating that `flyscrape` should use the cookie store of the local Chrome browser. + +```javascript {filename="Configuration"} +export const config = { + cookies: "firefox", +}; +``` + +In this example, the `cookies` option is set to `"firefox"`, instructing `flyscrape` to use the cookie store of the local Firefox browser. + +```javascript {filename="Configuration"} +export const config = { + cookies: "edge", +}; +``` + +In this example, the `cookies` option is set to `"edge"`, indicating that `flyscrape` should use the cookie store of the local Edge browser. diff --git a/content/docs/configuration/depth.md b/content/docs/configuration/depth.md new file mode 100644 index 0000000..d100470 --- /dev/null +++ b/content/docs/configuration/depth.md @@ -0,0 +1,24 @@ +--- +title: 'Depth' +weight: 2 +--- + +The `depth` config option allows you to specify how deep the scraping process should follow links from the initial URL. + +When no value is provided or `depth` is set to `0` link following is disabled and it will only scrape the initial URL. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + depth: 2, + // ... +}; +``` + +With the config provided in the example the scraper would follow links like this: + +``` +http://example.com/ (depth = 0, initial URL) +↳ http://example.com/deeply (depth = 1) + ↳ http://example.com/deeply/nested (depth = 2) +``` diff --git a/content/docs/configuration/domain-filter.md b/content/docs/configuration/domain-filter.md new file mode 100644 index 0000000..184ee2f --- /dev/null +++ b/content/docs/configuration/domain-filter.md @@ -0,0 +1,43 @@ +--- +title: 'Domain Filter' +weight: 3 +--- + +The `allowedDomains` and `blockedDomains` config options allow you to specify a list of domains which are accessible or blocked during scraping. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedDomains: ["subdomain.example.com"], + // ... +}; +``` + +## Allowed Domains + +This config option controls which additional domains are allowed to be visted during scraping. The domain of the initial URL is always allowed. + +You can also allow all domains to be accessible by setting `allowedDomains` to `["*"]`. To then further restrict access, you can specify `blockedDomains`. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedDomains: ["*"], + // ... +}; +``` + +## Blocked Domains + +This config option controls which additional domains are blocked from being accessed. By default all domains other than the domain of the initial URL or those specified in `allowedDomains` are blocked. + +You can best use `blockedDomains` in conjunction with `allowedDomains: ["*"]`, allowing the scraping process to access all domains except what's specified in `blockedDomains`. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedDomains: ["*"], + blockedDomains: ["google.com", "bing.com"], + // ... +}; +``` diff --git a/content/docs/configuration/headers.md b/content/docs/configuration/headers.md new file mode 100644 index 0000000..2b8f82c --- /dev/null +++ b/content/docs/configuration/headers.md @@ -0,0 +1,17 @@ +--- +title: 'Headers' +weight: 9 +--- + +The `headers` config option allows you to specify the custom HTTP headers sent with each request. + +```javascript {filename="Configuration"} +export const config = { + headers: { + "Authorization": "Bearer ey....", + "User-Agent": "Mozilla/5.0 (Macintosh ...", + }, + // ... +}; +``` + diff --git a/content/docs/configuration/link-following.md b/content/docs/configuration/link-following.md new file mode 100644 index 0000000..b9755f7 --- /dev/null +++ b/content/docs/configuration/link-following.md @@ -0,0 +1,33 @@ +--- +title: 'Link Following' +weight: 5 +--- + +The `follow` config option allows you to specify a list of CSS selectors that determine which links the scraper should follow. + +When no value is provided the scraper will follow all links found with the `a[href]` selector. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + follow: [ + ".pagination > a[href]", + ".nav a[href]", + ], + // ... +}; +``` + +## Following non `href` attributes + +For special cases where the link is not to be found in the `href`, you specify a selector with a different ending attribute. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + follow: [ + ".articles > div[data-url]", + ], + // ... +}; +``` diff --git a/content/docs/configuration/output.md b/content/docs/configuration/output.md new file mode 100644 index 0000000..2470865 --- /dev/null +++ b/content/docs/configuration/output.md @@ -0,0 +1,47 @@ +--- +title: 'Output File and Format' +weight: 10 +--- + +The output file and format are specified in the configuration object of your scraping script. They determine where the scraped data will be saved and in what format. + +## Output File + +The output file is the file where the scraped data will be saved. If not specified, the data will be printed to the standard output (stdout). + +```javascript {filename="Configuration"} +export const config = { + output: { + // Specify the output file. + file: "results.json", + }, +}; +``` + +In the above example, the scraped data will be saved in a file named `results.json`. + +## Output Format + +The output format is the format in which the scraped data will be saved. The options are `json` and `ndjson`. + +```javascript {filename="Configuration"} +export const config = { + output: { + // Specify the output format. + format: "json", + }, +}; +``` + +In the above example, the scraped data will be saved in JSON format. + +```javascript {filename="Configuration"} +export const config = { + output: { + // Specify the output format. + format: "ndjson", + }, +}; +``` + +In this example, the scraped data will be saved in newline-delimited JSON (NDJSON) format. Each line in the output file will be a separate JSON object. diff --git a/content/docs/configuration/proxies.md b/content/docs/configuration/proxies.md new file mode 100644 index 0000000..913630d --- /dev/null +++ b/content/docs/configuration/proxies.md @@ -0,0 +1,33 @@ +--- +title: 'Proxies' +weight: 8 +--- + +The proxy feature allows you to route your scraping requests through a specified HTTP(S) proxy. This can be useful for bypassing IP-based rate limits or accessing region-restricted content. + +```javascript +export const config = { + // Specify a single HTTP(S) proxy URL. + proxy: "http://someproxy.com:8043", +}; +``` + +In the above example, all scraping requests will be routed through the proxy at `http://someproxy.com:8043`. + +## Multiple Proxies + +You can also specify multiple proxy URLs. The scraper will rotate between these proxies for each request. + +```javascript +export const config = { + // Specify multiple HTTP(S) proxy URLs. + proxies: [ + "http://someproxy.com:8043", + "http://someotherproxy.com:8043", + ], +}; +``` + +In this example, the scraper will randomly pick between the proxies at `http://someproxy.com:8043` and `http://someotherproxy.com:8043`. + +Note: If both `proxy` and `proxies` are specified, all proxies will be respected. diff --git a/content/docs/configuration/rate-limiting.md b/content/docs/configuration/rate-limiting.md new file mode 100644 index 0000000..4b5bf9c --- /dev/null +++ b/content/docs/configuration/rate-limiting.md @@ -0,0 +1,15 @@ +--- +title: 'Rate Limiting' +weight: 6 +--- + +The `rate` config option allows you to specify at which rate the scraper should send out requests. The rate is measured in _Requests per Minute_ (RPM). + +When no `rate` is specified, rate limiting is disabled and the scraper will send out requests as fast as it can. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + rate: 100, +}; +``` diff --git a/content/docs/configuration/retry.md b/content/docs/configuration/retry.md new file mode 100644 index 0000000..cf00698 --- /dev/null +++ b/content/docs/configuration/retry.md @@ -0,0 +1,26 @@ +--- +title: 'Retry' +weight: 6 +--- + +The retry feature allows the scraper to automatically retry failed requests. This is particularly useful when dealing with unstable networks or servers that occasionally return error status codes. + +The retry feature is automatically enabled and will retry requests that return the following HTTP status codes: + +- 403 Forbidden +- 408 Request Timeout +- 425 Too Early +- 429 Too Many Requests +- 500 Internal Server Error +- 502 Bad Gateway +- 503 Service Unavailable +- 504 Gateway Timeout + +### Retry Delays + +After a failed request, the scraper will wait for a certain amount of time before retrying the request. The delay increases with each consecutive failed attempt, according to the following schedule: + +- 1st retry: 1 second delay +- 2nd retry: 2 seconds delay +- 3rd retry: 5 seconds delay +- 4th retry: 10 seconds delay diff --git a/content/docs/configuration/starting-url.md b/content/docs/configuration/starting-url.md new file mode 100644 index 0000000..6b60d7e --- /dev/null +++ b/content/docs/configuration/starting-url.md @@ -0,0 +1,29 @@ +--- +title: 'Starting URL' +weight: 1 +prev: '/docs/configuration' +--- + +The `url` config option allows you to specify the initial URL at which the scraper should start its scraping process. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + // ... +}; +``` + +## Multiple starting URLs + +In case you have more than one URL you want to scrape (or to start from) you can specify them with the `urls` config option. + +```javascript {filename="Configuration"} +export const config = { + urls: [ + "http://example.com/", + "http://anothersite.com/", + "http://yetanothersite.com/", + ], + // ... +}; +``` diff --git a/content/docs/configuration/url-filter.md b/content/docs/configuration/url-filter.md new file mode 100644 index 0000000..80d3544 --- /dev/null +++ b/content/docs/configuration/url-filter.md @@ -0,0 +1,42 @@ +--- +title: 'URL Filter' +weight: 4 +prev: /docs/getting-started +--- + +The `allowedURLs` and `blockedURLs` config options allow you to specify a list of URL patterns (in form of regular expressions) which are accessible or blocked during scraping. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedURLs: ["/articles/.*", "/authors/.*"], + blockedURLs: ["/authors/admin"], + // ... +}; +``` + +## Allowed URLs + +This config option controls which URLs are allowed to be visted during scraping. When no value is provided all URLs are allowed to be visited if not otherwise blocked. + +When a list of URL patterns is provided, only URLs matching one or more of these patterns are allowed to be visted. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedURLs: ["/products/"], +}; +``` + +## Blocked URLs + +This config option controls which URLs are blocked from being visted during scraping. + +When a list of URL patterns is provided, URLs matching one or more of these patterns are blocked from to be visted. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + blockedURLs: ["/restricted"], +}; +``` diff --git a/content/docs/full-example-script.md b/content/docs/full-example-script.md new file mode 100644 index 0000000..41cccf5 --- /dev/null +++ b/content/docs/full-example-script.md @@ -0,0 +1,115 @@ +--- +title: 'Full Example Script' +weight: 3 +--- + +This script serves as a reference that show all features of Flyscrape and how to use them. Feel free to copy and paste this as a starter script. + +```javascript{filename="Reference"} +import { parse } from "flyscrape"; +import { download } from "flyscrape/http"; +import http from "flyscrape/http"; + +export const config = { + // Specify the URL to start scraping from. + url: "https://example.com/", + + // Specify the multiple URLs to start scraping from. (default = []) + urls: [ + "https://anothersite.com/", + "https://yetanother.com/", + ], + + // Enable rendering with headless browser. (default = false) + browser: true, + + // Specify if browser should be headless or not. (default = true) + headless: false, + + // Specify how deep links should be followed. (default = 0, no follow) + depth: 5, + + // Speficy the css selectors to follow. (default = ["a[href]"]) + follow: [".next > a", ".related a"], + + // Specify the allowed domains. ['*'] for all. (default = domain from url) + allowedDomains: ["example.com", "anothersite.com"], + + // Specify the blocked domains. (default = none) + blockedDomains: ["somesite.com"], + + // Specify the allowed URLs as regex. (default = all allowed) + allowedURLs: ["/posts", "/articles/\d+"], + + // Specify the blocked URLs as regex. (default = none) + blockedURLs: ["/admin"], + + // Specify the rate in requests per minute. (default = no rate limit) + rate: 60, + + // Specify the number of concurrent requests. (default = no limit) + concurrency: 1, + + // Specify a single HTTP(S) proxy URL. (default = no proxy) + // Note: Not compatible with browser mode. + proxy: "http://someproxy.com:8043", + + // Specify multiple HTTP(S) proxy URLs. (default = no proxy) + // Note: Not compatible with browser mode. + proxies: [ + "http://someproxy.com:8043", + "http://someotherproxy.com:8043", + ], + + // Enable file-based request caching. (default = no cache) + cache: "file", + + // Specify the HTTP request header. (default = none) + headers: { + "Authorization": "Bearer ...", + "User-Agent": "Mozilla ...", + }, + + // Use the cookie store of your local browser. (default = off) + // Options: "chrome" | "edge" | "firefox" + cookies: "chrome", + + // Specify the output options. + output: { + // Specify the output file. (default = stdout) + file: "results.json", + + // Specify the output format. (default = json) + // Options: "json" | "ndjson" + format: "json", + }, +}; + +export default function ({ doc, url, absoluteURL }) { + // doc - Contains the parsed HTML document + // url - Contains the scraped URL + // absoluteURL(...) - Transforms relative URLs into absolute URLs + + // Find all users. + const userlist = doc.find(".user") + + // Download the profile picture of each user. + userlist.each(user => { + const name = user.find(".name").text() + const pictureURL = absoluteURL(user.find("img").attr("src")); + + download(pictureURL, `profile-pictures/${name}.jpg`) + }) + + // Return users name, address and age. + return { + users: userlist.map(user => { + const name = user.find(".name").text() + const address = user.find(".address").text() + const age = user.find(".age").text() + + return { name, address, age }; + }) + }; +} +``` diff --git a/content/docs/getting-started.md b/content/docs/getting-started.md new file mode 100644 index 0000000..386104d --- /dev/null +++ b/content/docs/getting-started.md @@ -0,0 +1,123 @@ +--- +title: 'Getting Started' +weight: 1 +--- + +In this quick guide we will go over the core functionalities of Flyscrape and how to use it. Make sure you've got `flyscrape` up and running on your system. + +The quickest way to install Flyscrape on Mac, Linux or WSL is to run the following command. For more information or how to install it on Windows check out the [installation instructions](/docs/installation). + +```bash {filename="Terminal"} +curl -fsSL https://flyscrape.com/install | bash +``` + +## Overview + +Flyscrape is a standalone scraping tool tool that works with so called _scraping scripts_. + +Scraping scripts let you define what data you want to extract from a website using familiar JavaScript code you might recognize from jQuery or cherrio. Inside your scraping script, you can also configure how the Flyscrape should behave, e.g. what links to follow, what domains to access, how fast to send out requests, etc. + +When your happy with the initial version of your scraping script, you can run Flyscrape and it will go off and start scraping the websites you have defined. + +## Your first Scraping Script + +A new scraping script can be created using the `new` command. This script is meant as a helpful guide to let you explore the JavaScript API. + +Go a head and run the following command: +```sh {filename="Terminal"} +flyscrape new hackernews.js +``` + +This should have created you a new file called `hackernews.js` in your current directory. You can open it up in your favorite text editor. + +## Anatomy of a Scraping Script + +Let's look at the previously created `hackernews.js` file and go through it together. Every scraping script consists of two main parts: + +### Configuration + +The configuration is used to control the scraping behaviour. Here we can specify what URLs to scrape, how deep it should follow links or what domains should be allowed to acess. Besides these, there are a bunch more to explore. + +```javascript {filename="Configuration"} +export const config = { + url: "https://hackernews.com", + // depth: 0, + // allowedDomains: [], + // ... +} +``` + +### Data Extraction Logic + +The data extracting logic defines what data to extract from a website. In this example it grabs the posts from the website using the `doc` document object and extracts the individual links and their titles. The `absoluteURL` function is used to ensure that every relative link is converted into an absolute one. + +```javascript {filename="Data Extraction Logic"} +export default function({ doc, absoluteURL }) { + const title = doc.find("title"); + const posts = doc.find(".athing"); + + return { + title: title.text(), + posts: posts.map((post) => { + const link = post.find(".titleline > a"); + + return { + title: link.text(), + url: absoluteURL(link.attr("href")), + }; + }), + }; +} +``` + +### Starting the Development Mode + +Flyscrape has a built in Development Mode that allows you to quickly iterate and see changes to your script immediately. +It does so by watching your script for changes and re-runs the Data Extraction Logic against a cached version of the website. + +Let's try and fire that up using the following command: + +```sh {filename="Terminal"} +flyscrape dev hackernews.js +``` + +You should now see the extracted data of your target website. Note that no links are followed in this mode, even when otherwise specified in the configuration. + +Now let's try and change our script so we extract some more data like the user, who submitted the post. + +```diff {filename="hackernews.js"} + return { + title: title.text(), + posts: posts.map((post) => { + const link = post.find(".titleline > a"); ++ const meta = post.next(); + + return { + title: link.text(), + url: absoluteURL(link.attr("href")), ++ user: meta.find(".hnuser").text(), + }; + }), + }; +``` + +When you now save the file and look at your terminal again, the changes should have reflected and the user added to each of the posts. + +Once you're happy with the extraction logic, your can exit out by pressing CTRL+C. + +### Running the Scraper + +Now that your scraping script is configured and the extraction logic is in place, your can use the `run` command to execute the scraper. + +```sh {filename="Terminal"} +flyscrape run hackernews.js +``` + +This should output a JSON array of all scraped pages. + +### Learn more + +Once you're done experimenting feel fee to check many of Flyscrape's other features. There are plenty to customize it for your specific needs. + +- [Full Example Script](/docs/full-example-script) +- [API Reference](/docs/api-reference) diff --git a/content/docs/installation.md b/content/docs/installation.md new file mode 100644 index 0000000..4dfb213 --- /dev/null +++ b/content/docs/installation.md @@ -0,0 +1,91 @@ +--- +title: 'Installation' +weight: 2 +--- + +## Recommended + +The easiest way to install Flyscrape is to use the following command. Note: This only works on macOS, Linux and WSL (Windows Subsystem for Linux). + +```bash {filename="Terminal"} +curl -fsSL https://flyscrape.com/install | bash +``` + +## Alternative 1: Homebrew (macOS) + +If you are on macOS, you can install Flyscrape via [Homebrew](https://formulae.brew.sh/formula/flyscrape). + +```bash {filename="Terminal"} +brew install flyscrape +``` + +Otherwise you can download and install Flyscrape by using one of the pre-compiled binaries. + +## Alternative 2: Manual installation (all systems) + +Whether you are on macOS, Linux or Windows you can download one of the following archives +to your local machine or visit the [releases page on Github](https://github.com/philippta/flyscrape/releases). + +#### macOS +{{< cards >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_darwin_arm64.tar.gz" title="macOS (Apple Silicon)" icon="cloud-download" >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_darwin_amd64.tar.gz" title="macOS (Intel)" icon="cloud-download" >}} +{{< /cards >}} + +#### Linux +{{< cards >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_linux_amd64.tar.gz" title="Linux" icon="cloud-download" >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_linux_arm64.tar.gz" title="Linux (arm64)" icon="cloud-download" >}} +{{< /cards >}} + +#### Windows +{{< cards >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_windows_amd64.zip" title="Windows" icon="cloud-download" >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_windows_arm64.zip" title="Windows" icon="cloud-download" >}} +{{< /cards >}} + +### Unpack + +Unpack the downloaded archive by double-clicking on it or using the command line: + +```bash {filename="Terminal"} +tar xf flyscrape_<os>_<arch>.tar.gz +``` + +After unpacking you should find a folder with the same name as the archive, which contains the `flyscrape` executable. +Change directory into it using: + +```bash {filename="Terminal"} +cd flyscrape_<os>_<arch>/ +``` + +### Install + +In order to make the `flyscrape` executable globally available, you can move it to either location in your `$PATH` variable. +A good default location for that is `/usr/local/bin`. So move it using the following command: + +```bash {filename="Terminal"} +mv flyscrape /usr/local/bin/flyscrape +``` + +### Verify + +From here on you should be able to run `flyscrape` from any directory on your machine. To verify you can run the following command. +If everything went to plan you should see Flyscrapes help text: + +```text {filename="Terminal"} +flyscrape --help +``` +```text {filename="Terminal"} +flyscrape is a standalone and scriptable web scraper for efficiently extracting data from websites. + +Usage: + + flyscrape <command> [arguments] + +Commands: + + new creates a sample scraping script + run runs a scraping script + dev watches and re-runs a scraping script +``` |