From b1e2c8fd5cb5dfa46bc440a12eafaf56cd844b1c Mon Sep 17 00:00:00 2001 From: Philipp Tanlak Date: Mon, 24 Nov 2025 20:54:57 +0100 Subject: Docs --- content/_index.html | 196 +++++++++++++++++++++++ content/cloud.html | 229 +++++++++++++++++++++++++++ content/docs/_index.md | 34 ++++ content/docs/api-reference.md | 61 +++++++ content/docs/configuration/_index.md | 25 +++ content/docs/configuration/browser-mode.md | 40 +++++ content/docs/configuration/caching.md | 36 +++++ content/docs/configuration/concurrency.md | 18 +++ content/docs/configuration/cookies.md | 36 +++++ content/docs/configuration/depth.md | 24 +++ content/docs/configuration/domain-filter.md | 43 +++++ content/docs/configuration/headers.md | 17 ++ content/docs/configuration/link-following.md | 33 ++++ content/docs/configuration/output.md | 47 ++++++ content/docs/configuration/proxies.md | 33 ++++ content/docs/configuration/rate-limiting.md | 15 ++ content/docs/configuration/retry.md | 26 +++ content/docs/configuration/starting-url.md | 29 ++++ content/docs/configuration/url-filter.md | 42 +++++ content/docs/full-example-script.md | 115 ++++++++++++++ content/docs/getting-started.md | 123 ++++++++++++++ content/docs/installation.md | 91 +++++++++++ content/proxy.html | 111 +++++++++++++ 23 files changed, 1424 insertions(+) create mode 100644 content/_index.html create mode 100644 content/cloud.html create mode 100644 content/docs/_index.md create mode 100644 content/docs/api-reference.md create mode 100644 content/docs/configuration/_index.md create mode 100644 content/docs/configuration/browser-mode.md create mode 100644 content/docs/configuration/caching.md create mode 100644 content/docs/configuration/concurrency.md create mode 100644 content/docs/configuration/cookies.md create mode 100644 content/docs/configuration/depth.md create mode 100644 content/docs/configuration/domain-filter.md create mode 100644 content/docs/configuration/headers.md create mode 100644 content/docs/configuration/link-following.md create mode 100644 content/docs/configuration/output.md create mode 100644 content/docs/configuration/proxies.md create mode 100644 content/docs/configuration/rate-limiting.md create mode 100644 content/docs/configuration/retry.md create mode 100644 content/docs/configuration/starting-url.md create mode 100644 content/docs/configuration/url-filter.md create mode 100644 content/docs/full-example-script.md create mode 100644 content/docs/getting-started.md create mode 100644 content/docs/installation.md create mode 100644 content/proxy.html (limited to 'content') diff --git a/content/_index.html b/content/_index.html new file mode 100644 index 0000000..eafb0c1 --- /dev/null +++ b/content/_index.html @@ -0,0 +1,196 @@ +--- +title: Build customer scrapers in minutes +layout: hextra-home +images: + - images/ogimage.png +--- + +
+

+ Custom scrapers
+ built in minutes. +

+ +

+ Flyscrape is a modern toolkit for building custom scrapers in minutes.
It can render + JavaScript, use cookies of your browser and requires no Node.js or Python to run. +

+ + +
+ +
+ + +
+ + + +
+ +
+ +
+
+ +
+

+ Easy-peasy Setup. +

+

+ Flyscrape is a standalone scraping tool and does not need Node.js or Python. + Simply run flyscrape new + and you're ready to scrape. +
+
+ Visit + the Getting Started Guide +

+
+ + +
+ + +
+ +
+ + +
+

+ Browser / JS rendering. +

+

+ Browser Mode can help you scrape even the most difficult sites. Whether the site heavily relies on JavaScript or has Anti-Bot measures, it's always worth a shot. +
+
+ Visit + the Browser Mode Documentation +

+
+ +
+ + +
+ +
+ +
+

+ Access your Cookies. +

+

+ Give Flyscrape access to the cookie store of your personal browser. This makes scraping protected websites, that require an active login session easy as cake. +
+
+ Visit + the Cookie Documentation +

+
+ + +
+ + +
+ +
+ + +
+

+ Precise Request Control. +

+

+ Precisely control how fast requests are processed, what links to follow or what sites to avoid. + 8 different dials allow you to fine-tune Flyscrape's behaviour for virtually every website. +
+
+ Browse all Configuration Options +

+
+ +
+ + +
+ +
+ +
+

+ Extract exactly what you need. +

+

+ Flyscrape comes with the full power of JavaScript, allowing you to precisely define what you want to scrape from a website. + With its familiar jQuery- or cheerio-like API selecting HTML elements becomes second nature. +
+
+ Visit + the API Reference +

+
+ + +
+ + +
+ + +

+ Want to give Flyscrape a try? +

+

+ Dive into our user-friendly guide and discover how to get started with ease. +

+ +
+
+ + + Get + Started {{< icon name="chevron-right">}} + +
+ + diff --git a/content/cloud.html b/content/cloud.html new file mode 100644 index 0000000..5227c8b --- /dev/null +++ b/content/cloud.html @@ -0,0 +1,229 @@ +--- +title: Flyscrape Cloud +layout: home +--- + +
+

+ Scraping at scale shouldn't be difficult. +

+ +

+ Get the data you need without the complexity you don't.
+ Flyscrape Cloud handles the scheduling, processing and infrastructure so you can focus on the data that matters to your business. +

+ + +
+ + +
+ + +
+ + +
+ + +
+
+
+

+ Focus on scripts, not servers. +

+

+ Write your Flyscrape scripts locally, then upload them to Flyscrape Cloud through our simple interface. We handle all the infrastructure, scaling, and reliability challenges while you focus on extracting the data your business needs +

+
+ +
+ +
+
+
+ + +
+ + +
+
+ + +
+
+

+ Schedule, run, repeat. +

+

+ Set up automated scraping schedules that run daily, weekly, or on custom intervals. + All your scraped data is securely stored and instantly accessible to your entire team. + Never worry about server availability or downtime again. +

+
+
+ + +
+ + +
+
+

+ Bridge technical and business teams. +

+

+ Engineers build the scraping scripts, analysts transform and query the data with SQL, + and business teams export the insights they need. Flyscrape Cloud creates a unified + workflow that makes web data accessible across your entire organization. +

+
+ +
+ +
+
+ + +
+ + +
+
+ + +
+
+

+ Built for the toughest scraping challenges. +

+

+ Access our managed proxy network and browser rendering capabilities to scrape even the + most challenging sites. We handle IP rotation, browser fingerprinting, and all the + technical complexities of modern web scraping at scale. +

+
+
+ + +
+ + +
+

+ How Teams Use Flyscrape Cloud +

+ +
+ +
+

E-Commerce Price Monitoring

+

Track competitor prices across thousands of products daily. Make informed pricing + decisions based on real market data.

+
+ + +
+

Market Intelligence

+

Monitor industry news, product launches, and competitor moves automatically. + Stay ahead with timely, structured market data.

+
+ + +
+

Content Aggregation

+

Collect relevant content from multiple sources on a schedule. Transform and + analyze the data to identify trends and opportunities.

+
+
+
+ + +
+

+ Get Early Access +

+

+ We're currently onboarding select companies to our early access program. + Join now to get personalized onboarding and influence our product roadmap. +

+ +
+
+ +
+
+ +
+
+ +
+ +
+
+ + +
+

+ Frequently Asked Questions +

+ +
+ +
+

How does Flyscrape Cloud differ from using Flyscrape open-source?

+

Flyscrape Cloud provides the infrastructure, scheduling, storage, and team collaboration features needed to run your Flyscrape scripts at scale. You write scripts locally using the open-source tool, then deploy them to our cloud platform for reliable execution.

+
+ + +
+

What size companies is Flyscrape Cloud designed for?

+

We've designed Flyscrape Cloud for small to medium-sized businesses that need reliable web scraping but don't want to invest in building and maintaining their own infrastructure. Our platform scales with your needs.

+
+ + +
+

How does pricing work?

+

Pricing is based on usage volume, with plans starting for small teams and scaling up as your needs grow. Early access participants receive preferred pricing. Contact us for details specific to your use case.

+
+ + +
+

What kind of support is provided?

+

Early access customers receive personalized onboarding and dedicated support to ensure your scraping operations run smoothly. We work closely with you to configure the platform for your specific needs.

+
+
+
+ + + diff --git a/content/docs/_index.md b/content/docs/_index.md new file mode 100644 index 0000000..184d57d --- /dev/null +++ b/content/docs/_index.md @@ -0,0 +1,34 @@ +--- +title: 'Documentation' +linkTitle: 'Documentation' +sidebar: + open: true +--- + +## Introduction + +{{< cards >}} + {{< card link="getting-started" title="Getting started" icon="play" >}} + {{< card link="installation" title="Installation" icon="cog" >}} + {{< card link="reference-script" title="Reference Script" icon="clipboard-check" >}} + {{< card link="api-reference" title="API Reference" icon="credit-card" >}} +{{< /cards >}} + +## Configuration + +{{< cards >}} + {{< card link="configuration/starting-url" title="Starting URL" icon="play" >}} + {{< card link="configuration/depth" title="Depth" icon="arrow-down" >}} + {{< card link="configuration/domain-filter" title="Domain Filter" icon="cube-transparent" >}} + {{< card link="configuration/url-filter" title="URL Filter" icon="sparkles" >}} + {{< card link="configuration/link-following" title="Link Following" icon="link" >}} + {{< card link="configuration/concurrency" title="Concurrency" icon="paper-airplane" >}} + {{< card link="configuration/rate-limiting" title="Rate Limiting" icon="chart-square-bar" >}} + {{< card link="configuration/retry" title="Retry" icon="refresh" >}} + {{< card link="configuration/caching" title="Caching" icon="template" >}} + {{< card link="configuration/proxies" title="Proxies" icon="server" >}} + {{< card link="configuration/cookies" title="Cookies" icon="finger-print" >}} + {{< card link="configuration/headers" title="Headers" icon="sort-ascending" >}} + {{< card link="configuration/browser-mode" title="Browser Mode" icon="desktop-computer" >}} + {{< card link="configuration/output" title="Output File and Format" icon="presentation-chart-bar" >}} +{{< /cards >}} diff --git a/content/docs/api-reference.md b/content/docs/api-reference.md new file mode 100644 index 0000000..99ac2f9 --- /dev/null +++ b/content/docs/api-reference.md @@ -0,0 +1,61 @@ +--- +title: 'API Reference' +weight: 4 +next: '/docs/configuration' +--- + +## Query API + +```javascript{filename="Reference"} +//
Hey
+const el = doc.find(".element") +el.text() // "Hey" +el.html() // `
Hey
` +el.attr("foo") // "bar" +el.hasAttr("foo") // true +el.hasClass("element") // true + +// +const list = doc.find("ul") +list.children() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ] + +const items = list.find("li") +items.length() // 3 +items.first() //
  • Item 1
  • +items.last() //
  • Item 3
  • +items.get(1) //
  • Item 2
  • +items.get(1).prev() //
  • Item 1
  • +items.get(1).next() //
  • Item 3
  • +items.get(1).parent() // +items.get(1).siblings() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ] +items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"] +items.filter(item => item.hasClass("a")) // [
  • Item 1
  • ] +``` + +## Document Parsing + +```javascript{filename="Reference"} +import { parse } from "flyscrape"; + +const doc = parse(`
    bar
    `); +const text = doc.find(".foo").text(); +``` + +## File Downloads + +```javascript{filename="Reference"} +import { download } from "flyscrape/http"; + +download("http://example.com/image.jpg") // downloads as "image.jpg" +download("http://example.com/image.jpg", "other.jpg") // downloads as "other.jpg" +download("http://example.com/image.jpg", "dir/") // downloads as "dir/image.jpg" + +// If the server offers a filename via the Content-Disposition header and no +// destination filename is provided, Flyscrape will honor the suggested filename. +// E.g. `Content-Disposition: attachment; filename="archive.zip"` +download("http://example.com/generate_archive.php", "dir/") // downloads as "dir/archive.zip" +``` diff --git a/content/docs/configuration/_index.md b/content/docs/configuration/_index.md new file mode 100644 index 0000000..ac27c8b --- /dev/null +++ b/content/docs/configuration/_index.md @@ -0,0 +1,25 @@ +--- +title: 'Configuration' +weight: 4 +sidebar: + open: true +next: '/docs/configuration/starting-url' +prev: '/docs/api-reference' +--- + +{{< cards >}} + {{< card link="starting-url" title="Starting URL" icon="play" >}} + {{< card link="depth" title="Depth" icon="arrow-down" >}} + {{< card link="domain-filter" title="Domain Filter" icon="cube-transparent" >}} + {{< card link="url-filter" title="URL Filter" icon="sparkles" >}} + {{< card link="link-following" title="Link Following" icon="link" >}} + {{< card link="concurrency" title="Concurrency" icon="paper-airplane" >}} + {{< card link="rate-limiting" title="Rate Limiting" icon="chart-square-bar" >}} + {{< card link="retry" title="Retry" icon="refresh" >}} + {{< card link="caching" title="Caching" icon="template" >}} + {{< card link="proxies" title="Proxies" icon="server" >}} + {{< card link="cookies" title="Cookies" icon="finger-print" >}} + {{< card link="headers" title="Headers" icon="sort-ascending" >}} + {{< card link="browser-mode" title="Browser Mode" icon="desktop-computer" >}} + {{< card link="output" title="Output File and Format" icon="presentation-chart-bar" >}} +{{< /cards >}} diff --git a/content/docs/configuration/browser-mode.md b/content/docs/configuration/browser-mode.md new file mode 100644 index 0000000..bbb2c1e --- /dev/null +++ b/content/docs/configuration/browser-mode.md @@ -0,0 +1,40 @@ +--- +title: 'Browser Mode' +weight: 10 +--- + +The Browser Mode controls the interaction with a headless Chromium browser. Enabling the browser mode allows `flyscrape` to download a Chromium browser once and use it to render JavaScript-heavy pages. + +## Browser Mode + +To enable Browser Mode, set the `browser` option to `true` in your configuration. This allows `flyscrape` to use a headless Chromium browser for rendering JavaScript during the scraping process. + +```javascript {filename="Configuration"} +export const config = { + browser: true, +}; +``` + +In the above example, Browser Mode is enabled, allowing `flyscrape` to render pages that rely on JavaScript execution. + +## Headless Option + +The `headless` option, when combined with Browser Mode, controls whether the Chromium browser should run in headless mode or not. Headless mode means the browser operates without a graphical user interface, which can be useful for background processes. + +```javascript {filename="Configuration"} +export const config = { + browser: true, + headless: false, +}; +``` + +In this example, the Chromium browser will run in non-headless mode. If you set `headless` to `true`, the browser will run without a visible GUI. + +```javascript {filename="Configuration"} +export const config = { + browser: true, + headless: true, +}; +``` + +In this example, the Chromium browser will run in headless mode, suitable for scenarios where graphical rendering is unnecessary. diff --git a/content/docs/configuration/caching.md b/content/docs/configuration/caching.md new file mode 100644 index 0000000..2c6766a --- /dev/null +++ b/content/docs/configuration/caching.md @@ -0,0 +1,36 @@ +--- +title: 'Caching' +weight: 7 +--- + +The `cache` config option allows you to enable file-based request caching. When enabled every request cached with its raw response. When the cache is populated and you re-run the scraper, requests will be served directly from cache. + +This also allows you to modify your scraping script afterwards and collect new results immediately. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + cache: "file", + // ... +}; +``` + +### Cache File + +When caching is enabled using the `cache: "file"` option, a `.cache` file will be created with the name of your scraping script. + +```bash {filename="Terminal"} +$ flyscrape run hackernews.js # Will populate: hackernews.cache +``` + +### Shared cache + +In case you want to share a cache between different scraping scripts, you can specify where to store the cache file. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + cache: "file:/some/path/shared.cache", + // ... +}; +``` diff --git a/content/docs/configuration/concurrency.md b/content/docs/configuration/concurrency.md new file mode 100644 index 0000000..0e5e181 --- /dev/null +++ b/content/docs/configuration/concurrency.md @@ -0,0 +1,18 @@ +--- +title: 'Concurrency' +weight: 6 +--- + +The concurrency setting controls the number of simultaneous requests that the scraper can make. This is specified in the configuration object of your scraping script. + +```javascript +export const config = { + // Specify the number of concurrent requests. + concurrency: 5, +}; +``` + +In the above example, the scraper will make up to 5 requests at the same time. + +If the concurrency setting is not specified, there is no limit to the number of concurrent requests. + diff --git a/content/docs/configuration/cookies.md b/content/docs/configuration/cookies.md new file mode 100644 index 0000000..f73d495 --- /dev/null +++ b/content/docs/configuration/cookies.md @@ -0,0 +1,36 @@ +--- +title: 'Cookies' +weight: 9 +--- + +The Cookies configuration in the `flyscrape` script's configuration object allows you to specify the behavior of the cookie store during the scraping process. Cookies are often used for authentication and session management on websites. + +## Cookies Configuration + +To configure the cookie store behavior, set the `cookies` field in your configuration. The `cookies` option supports three values: `"chrome"`, `"edge"`, and `"firefox"`. Each value corresponds to using the cookie store of the respective local browser. + +When the `cookies` option is set to `"chrome"`, `"edge"`, or `"firefox"`, `flyscrape` utilizes the cookie store of the user's installed browser. + +```javascript {filename="Configuration"} +export const config = { + cookies: "chrome", +}; +``` + +In the above example, the `cookies` option is set to `"chrome"`, indicating that `flyscrape` should use the cookie store of the local Chrome browser. + +```javascript {filename="Configuration"} +export const config = { + cookies: "firefox", +}; +``` + +In this example, the `cookies` option is set to `"firefox"`, instructing `flyscrape` to use the cookie store of the local Firefox browser. + +```javascript {filename="Configuration"} +export const config = { + cookies: "edge", +}; +``` + +In this example, the `cookies` option is set to `"edge"`, indicating that `flyscrape` should use the cookie store of the local Edge browser. diff --git a/content/docs/configuration/depth.md b/content/docs/configuration/depth.md new file mode 100644 index 0000000..d100470 --- /dev/null +++ b/content/docs/configuration/depth.md @@ -0,0 +1,24 @@ +--- +title: 'Depth' +weight: 2 +--- + +The `depth` config option allows you to specify how deep the scraping process should follow links from the initial URL. + +When no value is provided or `depth` is set to `0` link following is disabled and it will only scrape the initial URL. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + depth: 2, + // ... +}; +``` + +With the config provided in the example the scraper would follow links like this: + +``` +http://example.com/ (depth = 0, initial URL) +↳ http://example.com/deeply (depth = 1) + ↳ http://example.com/deeply/nested (depth = 2) +``` diff --git a/content/docs/configuration/domain-filter.md b/content/docs/configuration/domain-filter.md new file mode 100644 index 0000000..184ee2f --- /dev/null +++ b/content/docs/configuration/domain-filter.md @@ -0,0 +1,43 @@ +--- +title: 'Domain Filter' +weight: 3 +--- + +The `allowedDomains` and `blockedDomains` config options allow you to specify a list of domains which are accessible or blocked during scraping. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedDomains: ["subdomain.example.com"], + // ... +}; +``` + +## Allowed Domains + +This config option controls which additional domains are allowed to be visted during scraping. The domain of the initial URL is always allowed. + +You can also allow all domains to be accessible by setting `allowedDomains` to `["*"]`. To then further restrict access, you can specify `blockedDomains`. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedDomains: ["*"], + // ... +}; +``` + +## Blocked Domains + +This config option controls which additional domains are blocked from being accessed. By default all domains other than the domain of the initial URL or those specified in `allowedDomains` are blocked. + +You can best use `blockedDomains` in conjunction with `allowedDomains: ["*"]`, allowing the scraping process to access all domains except what's specified in `blockedDomains`. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedDomains: ["*"], + blockedDomains: ["google.com", "bing.com"], + // ... +}; +``` diff --git a/content/docs/configuration/headers.md b/content/docs/configuration/headers.md new file mode 100644 index 0000000..2b8f82c --- /dev/null +++ b/content/docs/configuration/headers.md @@ -0,0 +1,17 @@ +--- +title: 'Headers' +weight: 9 +--- + +The `headers` config option allows you to specify the custom HTTP headers sent with each request. + +```javascript {filename="Configuration"} +export const config = { + headers: { + "Authorization": "Bearer ey....", + "User-Agent": "Mozilla/5.0 (Macintosh ...", + }, + // ... +}; +``` + diff --git a/content/docs/configuration/link-following.md b/content/docs/configuration/link-following.md new file mode 100644 index 0000000..b9755f7 --- /dev/null +++ b/content/docs/configuration/link-following.md @@ -0,0 +1,33 @@ +--- +title: 'Link Following' +weight: 5 +--- + +The `follow` config option allows you to specify a list of CSS selectors that determine which links the scraper should follow. + +When no value is provided the scraper will follow all links found with the `a[href]` selector. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + follow: [ + ".pagination > a[href]", + ".nav a[href]", + ], + // ... +}; +``` + +## Following non `href` attributes + +For special cases where the link is not to be found in the `href`, you specify a selector with a different ending attribute. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + follow: [ + ".articles > div[data-url]", + ], + // ... +}; +``` diff --git a/content/docs/configuration/output.md b/content/docs/configuration/output.md new file mode 100644 index 0000000..2470865 --- /dev/null +++ b/content/docs/configuration/output.md @@ -0,0 +1,47 @@ +--- +title: 'Output File and Format' +weight: 10 +--- + +The output file and format are specified in the configuration object of your scraping script. They determine where the scraped data will be saved and in what format. + +## Output File + +The output file is the file where the scraped data will be saved. If not specified, the data will be printed to the standard output (stdout). + +```javascript {filename="Configuration"} +export const config = { + output: { + // Specify the output file. + file: "results.json", + }, +}; +``` + +In the above example, the scraped data will be saved in a file named `results.json`. + +## Output Format + +The output format is the format in which the scraped data will be saved. The options are `json` and `ndjson`. + +```javascript {filename="Configuration"} +export const config = { + output: { + // Specify the output format. + format: "json", + }, +}; +``` + +In the above example, the scraped data will be saved in JSON format. + +```javascript {filename="Configuration"} +export const config = { + output: { + // Specify the output format. + format: "ndjson", + }, +}; +``` + +In this example, the scraped data will be saved in newline-delimited JSON (NDJSON) format. Each line in the output file will be a separate JSON object. diff --git a/content/docs/configuration/proxies.md b/content/docs/configuration/proxies.md new file mode 100644 index 0000000..913630d --- /dev/null +++ b/content/docs/configuration/proxies.md @@ -0,0 +1,33 @@ +--- +title: 'Proxies' +weight: 8 +--- + +The proxy feature allows you to route your scraping requests through a specified HTTP(S) proxy. This can be useful for bypassing IP-based rate limits or accessing region-restricted content. + +```javascript +export const config = { + // Specify a single HTTP(S) proxy URL. + proxy: "http://someproxy.com:8043", +}; +``` + +In the above example, all scraping requests will be routed through the proxy at `http://someproxy.com:8043`. + +## Multiple Proxies + +You can also specify multiple proxy URLs. The scraper will rotate between these proxies for each request. + +```javascript +export const config = { + // Specify multiple HTTP(S) proxy URLs. + proxies: [ + "http://someproxy.com:8043", + "http://someotherproxy.com:8043", + ], +}; +``` + +In this example, the scraper will randomly pick between the proxies at `http://someproxy.com:8043` and `http://someotherproxy.com:8043`. + +Note: If both `proxy` and `proxies` are specified, all proxies will be respected. diff --git a/content/docs/configuration/rate-limiting.md b/content/docs/configuration/rate-limiting.md new file mode 100644 index 0000000..4b5bf9c --- /dev/null +++ b/content/docs/configuration/rate-limiting.md @@ -0,0 +1,15 @@ +--- +title: 'Rate Limiting' +weight: 6 +--- + +The `rate` config option allows you to specify at which rate the scraper should send out requests. The rate is measured in _Requests per Minute_ (RPM). + +When no `rate` is specified, rate limiting is disabled and the scraper will send out requests as fast as it can. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + rate: 100, +}; +``` diff --git a/content/docs/configuration/retry.md b/content/docs/configuration/retry.md new file mode 100644 index 0000000..cf00698 --- /dev/null +++ b/content/docs/configuration/retry.md @@ -0,0 +1,26 @@ +--- +title: 'Retry' +weight: 6 +--- + +The retry feature allows the scraper to automatically retry failed requests. This is particularly useful when dealing with unstable networks or servers that occasionally return error status codes. + +The retry feature is automatically enabled and will retry requests that return the following HTTP status codes: + +- 403 Forbidden +- 408 Request Timeout +- 425 Too Early +- 429 Too Many Requests +- 500 Internal Server Error +- 502 Bad Gateway +- 503 Service Unavailable +- 504 Gateway Timeout + +### Retry Delays + +After a failed request, the scraper will wait for a certain amount of time before retrying the request. The delay increases with each consecutive failed attempt, according to the following schedule: + +- 1st retry: 1 second delay +- 2nd retry: 2 seconds delay +- 3rd retry: 5 seconds delay +- 4th retry: 10 seconds delay diff --git a/content/docs/configuration/starting-url.md b/content/docs/configuration/starting-url.md new file mode 100644 index 0000000..6b60d7e --- /dev/null +++ b/content/docs/configuration/starting-url.md @@ -0,0 +1,29 @@ +--- +title: 'Starting URL' +weight: 1 +prev: '/docs/configuration' +--- + +The `url` config option allows you to specify the initial URL at which the scraper should start its scraping process. + +```javascript {filename="Configuration"} +export const config = { + url: "http://example.com/", + // ... +}; +``` + +## Multiple starting URLs + +In case you have more than one URL you want to scrape (or to start from) you can specify them with the `urls` config option. + +```javascript {filename="Configuration"} +export const config = { + urls: [ + "http://example.com/", + "http://anothersite.com/", + "http://yetanothersite.com/", + ], + // ... +}; +``` diff --git a/content/docs/configuration/url-filter.md b/content/docs/configuration/url-filter.md new file mode 100644 index 0000000..80d3544 --- /dev/null +++ b/content/docs/configuration/url-filter.md @@ -0,0 +1,42 @@ +--- +title: 'URL Filter' +weight: 4 +prev: /docs/getting-started +--- + +The `allowedURLs` and `blockedURLs` config options allow you to specify a list of URL patterns (in form of regular expressions) which are accessible or blocked during scraping. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedURLs: ["/articles/.*", "/authors/.*"], + blockedURLs: ["/authors/admin"], + // ... +}; +``` + +## Allowed URLs + +This config option controls which URLs are allowed to be visted during scraping. When no value is provided all URLs are allowed to be visited if not otherwise blocked. + +When a list of URL patterns is provided, only URLs matching one or more of these patterns are allowed to be visted. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + allowedURLs: ["/products/"], +}; +``` + +## Blocked URLs + +This config option controls which URLs are blocked from being visted during scraping. + +When a list of URL patterns is provided, URLs matching one or more of these patterns are blocked from to be visted. + +```javascript {filename="Configuration"} +export const options = { + url: "http://example.com/", + blockedURLs: ["/restricted"], +}; +``` diff --git a/content/docs/full-example-script.md b/content/docs/full-example-script.md new file mode 100644 index 0000000..41cccf5 --- /dev/null +++ b/content/docs/full-example-script.md @@ -0,0 +1,115 @@ +--- +title: 'Full Example Script' +weight: 3 +--- + +This script serves as a reference that show all features of Flyscrape and how to use them. Feel free to copy and paste this as a starter script. + +```javascript{filename="Reference"} +import { parse } from "flyscrape"; +import { download } from "flyscrape/http"; +import http from "flyscrape/http"; + +export const config = { + // Specify the URL to start scraping from. + url: "https://example.com/", + + // Specify the multiple URLs to start scraping from. (default = []) + urls: [ + "https://anothersite.com/", + "https://yetanother.com/", + ], + + // Enable rendering with headless browser. (default = false) + browser: true, + + // Specify if browser should be headless or not. (default = true) + headless: false, + + // Specify how deep links should be followed. (default = 0, no follow) + depth: 5, + + // Speficy the css selectors to follow. (default = ["a[href]"]) + follow: [".next > a", ".related a"], + + // Specify the allowed domains. ['*'] for all. (default = domain from url) + allowedDomains: ["example.com", "anothersite.com"], + + // Specify the blocked domains. (default = none) + blockedDomains: ["somesite.com"], + + // Specify the allowed URLs as regex. (default = all allowed) + allowedURLs: ["/posts", "/articles/\d+"], + + // Specify the blocked URLs as regex. (default = none) + blockedURLs: ["/admin"], + + // Specify the rate in requests per minute. (default = no rate limit) + rate: 60, + + // Specify the number of concurrent requests. (default = no limit) + concurrency: 1, + + // Specify a single HTTP(S) proxy URL. (default = no proxy) + // Note: Not compatible with browser mode. + proxy: "http://someproxy.com:8043", + + // Specify multiple HTTP(S) proxy URLs. (default = no proxy) + // Note: Not compatible with browser mode. + proxies: [ + "http://someproxy.com:8043", + "http://someotherproxy.com:8043", + ], + + // Enable file-based request caching. (default = no cache) + cache: "file", + + // Specify the HTTP request header. (default = none) + headers: { + "Authorization": "Bearer ...", + "User-Agent": "Mozilla ...", + }, + + // Use the cookie store of your local browser. (default = off) + // Options: "chrome" | "edge" | "firefox" + cookies: "chrome", + + // Specify the output options. + output: { + // Specify the output file. (default = stdout) + file: "results.json", + + // Specify the output format. (default = json) + // Options: "json" | "ndjson" + format: "json", + }, +}; + +export default function ({ doc, url, absoluteURL }) { + // doc - Contains the parsed HTML document + // url - Contains the scraped URL + // absoluteURL(...) - Transforms relative URLs into absolute URLs + + // Find all users. + const userlist = doc.find(".user") + + // Download the profile picture of each user. + userlist.each(user => { + const name = user.find(".name").text() + const pictureURL = absoluteURL(user.find("img").attr("src")); + + download(pictureURL, `profile-pictures/${name}.jpg`) + }) + + // Return users name, address and age. + return { + users: userlist.map(user => { + const name = user.find(".name").text() + const address = user.find(".address").text() + const age = user.find(".age").text() + + return { name, address, age }; + }) + }; +} +``` diff --git a/content/docs/getting-started.md b/content/docs/getting-started.md new file mode 100644 index 0000000..386104d --- /dev/null +++ b/content/docs/getting-started.md @@ -0,0 +1,123 @@ +--- +title: 'Getting Started' +weight: 1 +--- + +In this quick guide we will go over the core functionalities of Flyscrape and how to use it. Make sure you've got `flyscrape` up and running on your system. + +The quickest way to install Flyscrape on Mac, Linux or WSL is to run the following command. For more information or how to install it on Windows check out the [installation instructions](/docs/installation). + +```bash {filename="Terminal"} +curl -fsSL https://flyscrape.com/install | bash +``` + +## Overview + +Flyscrape is a standalone scraping tool tool that works with so called _scraping scripts_. + +Scraping scripts let you define what data you want to extract from a website using familiar JavaScript code you might recognize from jQuery or cherrio. Inside your scraping script, you can also configure how the Flyscrape should behave, e.g. what links to follow, what domains to access, how fast to send out requests, etc. + +When your happy with the initial version of your scraping script, you can run Flyscrape and it will go off and start scraping the websites you have defined. + +## Your first Scraping Script + +A new scraping script can be created using the `new` command. This script is meant as a helpful guide to let you explore the JavaScript API. + +Go a head and run the following command: +```sh {filename="Terminal"} +flyscrape new hackernews.js +``` + +This should have created you a new file called `hackernews.js` in your current directory. You can open it up in your favorite text editor. + +## Anatomy of a Scraping Script + +Let's look at the previously created `hackernews.js` file and go through it together. Every scraping script consists of two main parts: + +### Configuration + +The configuration is used to control the scraping behaviour. Here we can specify what URLs to scrape, how deep it should follow links or what domains should be allowed to acess. Besides these, there are a bunch more to explore. + +```javascript {filename="Configuration"} +export const config = { + url: "https://hackernews.com", + // depth: 0, + // allowedDomains: [], + // ... +} +``` + +### Data Extraction Logic + +The data extracting logic defines what data to extract from a website. In this example it grabs the posts from the website using the `doc` document object and extracts the individual links and their titles. The `absoluteURL` function is used to ensure that every relative link is converted into an absolute one. + +```javascript {filename="Data Extraction Logic"} +export default function({ doc, absoluteURL }) { + const title = doc.find("title"); + const posts = doc.find(".athing"); + + return { + title: title.text(), + posts: posts.map((post) => { + const link = post.find(".titleline > a"); + + return { + title: link.text(), + url: absoluteURL(link.attr("href")), + }; + }), + }; +} +``` + +### Starting the Development Mode + +Flyscrape has a built in Development Mode that allows you to quickly iterate and see changes to your script immediately. +It does so by watching your script for changes and re-runs the Data Extraction Logic against a cached version of the website. + +Let's try and fire that up using the following command: + +```sh {filename="Terminal"} +flyscrape dev hackernews.js +``` + +You should now see the extracted data of your target website. Note that no links are followed in this mode, even when otherwise specified in the configuration. + +Now let's try and change our script so we extract some more data like the user, who submitted the post. + +```diff {filename="hackernews.js"} + return { + title: title.text(), + posts: posts.map((post) => { + const link = post.find(".titleline > a"); ++ const meta = post.next(); + + return { + title: link.text(), + url: absoluteURL(link.attr("href")), ++ user: meta.find(".hnuser").text(), + }; + }), + }; +``` + +When you now save the file and look at your terminal again, the changes should have reflected and the user added to each of the posts. + +Once you're happy with the extraction logic, your can exit out by pressing CTRL+C. + +### Running the Scraper + +Now that your scraping script is configured and the extraction logic is in place, your can use the `run` command to execute the scraper. + +```sh {filename="Terminal"} +flyscrape run hackernews.js +``` + +This should output a JSON array of all scraped pages. + +### Learn more + +Once you're done experimenting feel fee to check many of Flyscrape's other features. There are plenty to customize it for your specific needs. + +- [Full Example Script](/docs/full-example-script) +- [API Reference](/docs/api-reference) diff --git a/content/docs/installation.md b/content/docs/installation.md new file mode 100644 index 0000000..4dfb213 --- /dev/null +++ b/content/docs/installation.md @@ -0,0 +1,91 @@ +--- +title: 'Installation' +weight: 2 +--- + +## Recommended + +The easiest way to install Flyscrape is to use the following command. Note: This only works on macOS, Linux and WSL (Windows Subsystem for Linux). + +```bash {filename="Terminal"} +curl -fsSL https://flyscrape.com/install | bash +``` + +## Alternative 1: Homebrew (macOS) + +If you are on macOS, you can install Flyscrape via [Homebrew](https://formulae.brew.sh/formula/flyscrape). + +```bash {filename="Terminal"} +brew install flyscrape +``` + +Otherwise you can download and install Flyscrape by using one of the pre-compiled binaries. + +## Alternative 2: Manual installation (all systems) + +Whether you are on macOS, Linux or Windows you can download one of the following archives +to your local machine or visit the [releases page on Github](https://github.com/philippta/flyscrape/releases). + +#### macOS +{{< cards >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_darwin_arm64.tar.gz" title="macOS (Apple Silicon)" icon="cloud-download" >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_darwin_amd64.tar.gz" title="macOS (Intel)" icon="cloud-download" >}} +{{< /cards >}} + +#### Linux +{{< cards >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_linux_amd64.tar.gz" title="Linux" icon="cloud-download" >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_linux_arm64.tar.gz" title="Linux (arm64)" icon="cloud-download" >}} +{{< /cards >}} + +#### Windows +{{< cards >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_windows_amd64.zip" title="Windows" icon="cloud-download" >}} +{{< card link="https://github.com/philippta/flyscrape/releases/latest/download/flyscrape_0.9.0_windows_arm64.zip" title="Windows" icon="cloud-download" >}} +{{< /cards >}} + +### Unpack + +Unpack the downloaded archive by double-clicking on it or using the command line: + +```bash {filename="Terminal"} +tar xf flyscrape__.tar.gz +``` + +After unpacking you should find a folder with the same name as the archive, which contains the `flyscrape` executable. +Change directory into it using: + +```bash {filename="Terminal"} +cd flyscrape__/ +``` + +### Install + +In order to make the `flyscrape` executable globally available, you can move it to either location in your `$PATH` variable. +A good default location for that is `/usr/local/bin`. So move it using the following command: + +```bash {filename="Terminal"} +mv flyscrape /usr/local/bin/flyscrape +``` + +### Verify + +From here on you should be able to run `flyscrape` from any directory on your machine. To verify you can run the following command. +If everything went to plan you should see Flyscrapes help text: + +```text {filename="Terminal"} +flyscrape --help +``` +```text {filename="Terminal"} +flyscrape is a standalone and scriptable web scraper for efficiently extracting data from websites. + +Usage: + + flyscrape [arguments] + +Commands: + + new creates a sample scraping script + run runs a scraping script + dev watches and re-runs a scraping script +``` diff --git a/content/proxy.html b/content/proxy.html new file mode 100644 index 0000000..7977164 --- /dev/null +++ b/content/proxy.html @@ -0,0 +1,111 @@ +--- +title: Flyscrape Proxyᴮᴱᵀᴬ +layout: home +--- + +
    +

    + Stop getting blocked. +

    + +

    + Flyscrape Proxyᴮᴱᵀᴬ is a proxy service that allows you to get around firewalls or render websites with real and undetected browser. Choose between countries, use Auto IP Rotation or enable browser rendering. +

    + + +
    + + +
    + +
    + +
    +
    + +
    +

    + Perfect Bot / Human Score +

    +

    + Enable browser rendering with browser=true to get around the most difficult anti-bot challenges or simply to render JavaScript intensive websites. Using a real and undetected browser makes it appear as if the request came from a human. +

    +
    + + +
    + + +
    + +
    + + +
    +

    + Ditch your Proxy List +

    +

    + Stop thinking about maintaining a huge list of proxies. With Flyscrape Proxyᴮᴱᵀᴬ you only need a single proxy URL that automatically rotates your IP address accross the entire globe on every request. +

    +
    + +
    + + +
    + +
    + +
    +

    + Bypass Geo-Restrictions +

    +

    + Send requests from countries all across the globle. The country selector country=netherlands allows you to pick from a list of 40 different countries to bypass any geo blocking firewall. +

    +
    + + +
    + + + +
    + + +

    + Want to give Flyscrape Proxyᴮᴱᵀᴬ a try? +

    +

    + Get your proxy URL and access token now by signing up with Flyscrape Proxyᴮᴱᵀᴬ.
    It's completely free. No credit card required. No BS. +

    + +
    + + + -- cgit v1.2.3