flyscrape is a standalone and scriptable web scraper, combining the speed of Go with the flexibility of JavaScript. — Focus on data extraction rather than request juggling.


Installation · Documentation · Releases

## Features - **Highly Configurable:** 10 options to fine-tune your scraper. - **Standalone:** flyscrape comes as a single binary executable. - **Scriptable:** Use JavaScript to write your data extraction logic. - **Simple API:** Extract data from HTML pages with a familiar API. - **Fast Iteration:** Use the development mode to get quick feedback. - **Request Caching:** Re-run scripts on websites you already scraped. - **Zero Dependencies:** No need to fill up your disk with npm packages. ## Example script ```javascript export const config = { url: "https://news.ycombinator.com/", } export default function ({ doc, absoluteURL }) { const title = doc.find("title"); const posts = doc.find(".athing"); return { title: title.text(), posts: posts.map((post) => { const link = post.find(".titleline > a"); return { title: link.text(), url: link.attr("href"), }; }), } } ``` ```bash $ flyscrape run hackernews.js [ { "url": "https://news.ycombinator.com/", "data": { "title": "Hacker News", "posts": [ { "title": "Show HN: flyscrape - An standalone and scriptable web scraper", "url": "https://flyscrape.com/" }, ... ] } } ] ``` ## Installation ### Pre-compiled binary `flyscrape` is available for MacOS, Linux and Windows as a downloadable binary from the [releases page](https://github.com/philippta/flyscrape/releases). ### Compile from source To compile flyscrape from source, follow these steps: 1. Install Go: Make sure you have Go installed on your system. If not, you can download it from [https://golang.org/](https://golang.org/). 2. Install flyscrape: Open a terminal and run the following command: ```bash go install github.com/philippta/flyscrape/cmd/flyscrape@latest ``` ## Usage ``` flyscrape is a standalone and scriptable web scraper for efficiently extracting data from websites. Usage: flyscrape [arguments] Commands: new creates a sample scraping script run runs a scraping script dev watches and re-runs a scraping script ``` ## Configuration Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the [documentation page](docs/readme.md#configuration). ```javascript export const config = { url: "https://example.com/", // Specify the URL to start scraping from. depth: 0, // Specify how deep links should be followed. (default = 0, no follow) follow: [], // Speficy the css selectors to follow (default = ["a[href]"]) allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url) blockedDomains: [], // Specify the blocked domains. (default = none) allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed) blockedURLs: [], // Specify the blocked URLs as regex. (default = none) rate: 100, // Specify the rate in requests per second. (default = no rate limit) proxies: [], // Specify the HTTP(S) proxy URLs. (default = no proxy) cache: "file", // Enable file-based request caching. (default = no cache) }; export default function ({ doc, url, absoluteURL }) { // doc - Contains the parsed HTML document // url - Contains the scraped URL // absoluteURL(...) - Transforms relative URLs into absolute URLs } ``` ## Query API ```javascript //
Hey
const el = doc.find(".element") el.text() // "Hey" el.html() // `
Hey
` el.attr("foo") // "bar" el.hasAttr("foo") // true el.hasClass("element") // true // const list = doc.find("ul") list.children() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ] const items = list.find("li") items.length() // 3 items.first() //
  • Item 1
  • items.last() //
  • Item 3
  • items.get(1) //
  • Item 2
  • items.get(1).prev() //
  • Item 1
  • items.get(1).next() //
  • Item 3
  • items.get(1).parent() // items.get(1).siblings() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ] items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"] items.filter(item => item.hasClass("a")) // [
  • Item 1
  • ] ``` ## Issues and Suggestions If you encounter any issues or have suggestions for improvement, please [submit an issue](https://github.com/philippta/flyscrape/issues).