flyscrape is a standalone and scriptable web scraper, combining the speed of Go with the flexibility of JavaScript. — Focus on data extraction rather than request juggling.


## Features - Domains and URL filtering - Depth control - Request caching - Rate limiting - HTTP(s) Proxy support - Development mode - Single binary executable ## Example script ```javascript export const config = { url: "https://news.ycombinator.com/", } export default function ({ doc, absoluteURL }) { const title = doc.find("title"); const posts = doc.find(".athing"); return { title: title.text(), posts: posts.map((post) => { const link = post.find(".titleline > a"); return { title: link.text(), url: link.attr("href"), }; }), } } ``` ```bash $ flyscrape run hackernews.js [ { "url": "https://news.ycombinator.com/", "data": { "title": "Hacker News", "posts": [ { "title": "Show HN: flyscrape - An expressive and elegant web scraper", "url": "https://flyscrape.com" }, ... ] } } ] ``` ## Installation ### Pre-compiled binary `flyscrape` is available via for MacOS, Linux and Window as a downloadable binary from the [releases page](https://github.com/philippta/flyscrape/releases). ### Compile from source To compile flyscrape from source, follow these steps: 1. Install Go: Make sure you have Go installed on your system. If not, you can download it from [https://golang.org/](https://golang.org/). 2. Install flyscrape: Open a terminal and run the following command: ```bash go install github.com/philippta/flyscrape/cmd/flyscrape@latest ``` ## Usage ``` flyscrape is a standalone and scriptable web scraper for efficiently extracting data from websites. Usage: flyscrape [arguments] Commands: new creates a sample scraping script run runs a scraping script dev watches and re-runs a scraping script ``` ### Create a new sample scraping script The `new` command allows you to create a new boilerplate sample script which helps you getting started. ``` flyscrape new example.js ``` ### Watch the script for changes during development The `dev` command allows you to watch your scraping script for changes and quickly iterate during development. In development mode, flyscrape will not follow any links and request caching is enabled. ``` flyscrape dev example.js ``` ### Run the scraping script The `run` command allows you to run your script. ``` flyscrape run example.js ``` ## Configuration Below is an example scraping script that showcases the capabilities of flyscrape: ```javascript export const config = { url: "https://example.com/", // Specify the URL to start scraping from. depth: 0, // Specify how deep links should be followed. (default = 0, no follow) follow: [], // Speficy the css selectors to follow (default = ["a[href]"]) allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url) blockedDomains: [], // Specify the blocked domains. (default = none) allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed) blockedURLs: [], // Specify the blocked URLs as regex. (default = none) rate: 100, // Specify the rate in requests per second. (default = no rate limit) proxies: [], // Specify the HTTP(s) proxy URLs. (default = no proxy) cache: "file", // Enable file-based request caching. (default = no cache) }; export default function ({ doc, url, absoluteURL }) { // doc - Contains the parsed HTML document // url - Contains the scraped URL // absoluteURL(...) - Transforms relative URLs into absolute URLs } ``` ## Query API ```javascript //
Hey
const el = doc.find(".element") el.text() // "Hey" el.html() // `
Hey
` el.attr("foo") // "bar" el.hasAttr("foo") // true el.hasClass("element") // true // const list = doc.find("ul") list.children() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ] const items = list.find("li") items.length() // 3 items.first() //
  • Item 1
  • items.last() //
  • Item 3
  • items.get(1) //
  • Item 2
  • items.get(1).prev() //
  • Item 1
  • items.get(1).next() //
  • Item 3
  • items.get(1).parent() // items.get(1).siblings() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ] items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"] items.filter(item => item.hasClass("a")) // [
  • Item 1
  • ] ``` ## Contributing We welcome contributions from the community! If you encounter any issues or have suggestions for improvement, please [submit an issue](https://github.com/philippta/flyscrape/issues).