flyscrape is a standalone and scriptable web scraper, combining the speed of Go with the flexibility of JavaScript. — Focus on data extraction rather than request juggling.
Installation · Documentation · Releases
## Features - **Highly Configurable:** 10 options to fine-tune your scraper. - **Standalone:** flyscrape comes as a single binary executable. - **Scriptable:** Use JavaScript to write your data extraction logic. - **Simple API:** Extract data from HTML pages with a familiar API. - **Fast Iteration:** Use the development mode to get quick feedback. - **Request Caching:** Re-run scripts on websites you already scraped. - **Zero Dependencies:** No need to fill up your disk with npm packages. ## Example script ```javascript export const config = { url: "https://news.ycombinator.com/", // urls: [] // Specify additional URLs to start from. (default = none) // depth: 0, // Specify how deep links should be followed. (default = 0, no follow) // follow: [], // Speficy the css selectors to follow (default = ["a[href]"]) // allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url) // blockedDomains: [], // Specify the blocked domains. (default = none) // allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed) // blockedURLs: [], // Specify the blocked URLs as regex. (default = none) // rate: 100, // Specify the rate in requests per second. (default = no rate limit) // proxies: [], // Specify the HTTP(S) proxy URLs. (default = no proxy) // cache: "file", // Enable file-based request caching. (default = no cache) } export default function ({ doc, absoluteURL }) { const title = doc.find("title"); const posts = doc.find(".athing"); return { title: title.text(), posts: posts.map((post) => { const link = post.find(".titleline > a"); return { title: link.text(), url: link.attr("href"), }; }), } } ``` ```bash $ flyscrape run hackernews.js [ { "url": "https://news.ycombinator.com/", "data": { "title": "Hacker News", "posts": [ { "title": "Show HN: flyscrape - An standalone and scriptable web scraper", "url": "https://flyscrape.com/" }, ... ] } } ] ``` Check out the [examples folder](examples) for more detailed examples. ## Installation ### Pre-compiled binary `flyscrape` is available for MacOS, Linux and Windows as a downloadable binary from the [releases page](https://github.com/philippta/flyscrape/releases). ### Compile from source To compile flyscrape from source, follow these steps: 1. Install Go: Make sure you have Go installed on your system. If not, you can download it from [https://golang.org/](https://golang.org/). 2. Install flyscrape: Open a terminal and run the following command: ```bash go install github.com/philippta/flyscrape/cmd/flyscrape@latest ``` ## Usage ``` Usage: flyscrape run SCRIPT [config flags] Examples: # Run the script. $ flyscrape run example.js # Set the URL as argument. $ flyscrape run example.js --url "http://other.com" # Enable proxy support. $ flyscrape run example.js --proxies "http://someproxy:8043" # Follow paginated links. $ flyscrape run example.js --depth 5 --follow ".next-button > a" ``` ## Configuration Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the [documentation page](docs/readme.md#configuration). ```javascript export const config = { url: "https://example.com/", // Specify the URL to start scraping from. urls: [ // Specify the URL(s) to start scraping from. If both `url` and `urls` "https://example.com/foo", // are provided, all of the specified URLs will be scraped. "https://example.com/bar", ], depth: 0, // Specify how deep links should be followed. (default = 0, no follow) follow: [], // Speficy the css selectors to follow (default = ["a[href]"]) allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url) blockedDomains: [], // Specify the blocked domains. (default = none) allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed) blockedURLs: [], // Specify the blocked URLs as regex. (default = none) rate: 100, // Specify the rate in requests per second. (default = no rate limit) proxies: [], // Specify the HTTP(S) proxy URLs. (default = no proxy) cache: "file", // Enable file-based request caching. (default = no cache) headers: { // Specify the HTTP request header. (default = none) "Authorization": "Basic ZGVtbzpwQDU1dzByZA==", "User-Agent": "Gecko/1.0", }, }; export function setup() { // Optional setup function, called once before scraping starts. // Can be used for authentication. } export default function ({ doc, url, absoluteURL }) { // doc - Contains the parsed HTML document // url - Contains the scraped URL // absoluteURL(...) - Transforms relative URLs into absolute URLs } ``` ## Query API ```javascript //