flyscrape is a standalone and scriptable web scraper, combining the speed of Go with the flexibility of JavaScript. — Focus on data extraction rather than request juggling.


Installation · Documentation · Releases

## Features - **Highly Configurable:** 10 options to fine-tune your scraper. - **Standalone:** flyscrape comes as a single binary executable. - **Scriptable:** Use JavaScript to write your data extraction logic. - **Simple API:** Extract data from HTML pages with a familiar API. - **Fast Iteration:** Use the development mode to get quick feedback. - **Request Caching:** Re-run scripts on websites you already scraped. - **Zero Dependencies:** No need to fill up your disk with npm packages. ## Example script ```javascript export const config = { url: "https://news.ycombinator.com/", } export default function ({ doc, absoluteURL }) { const title = doc.find("title"); const posts = doc.find(".athing"); return { title: title.text(), posts: posts.map((post) => { const link = post.find(".titleline > a"); return { title: link.text(), url: link.attr("href"), }; }), } } ``` ```bash $ flyscrape run hackernews.js [ { "url": "https://news.ycombinator.com/", "data": { "title": "Hacker News", "posts": [ { "title": "Show HN: flyscrape - An standalone and scriptable web scraper", "url": "https://flyscrape.com/" }, ... ] } } ] ``` Check out the [examples folder](examples) for more detailed examples. ## Installation ### Pre-compiled binary `flyscrape` is available for MacOS, Linux and Windows as a downloadable binary from the [releases page](https://github.com/philippta/flyscrape/releases). ### Compile from source To compile flyscrape from source, follow these steps: 1. Install Go: Make sure you have Go installed on your system. If not, you can download it from [https://golang.org/](https://golang.org/). 2. Install flyscrape: Open a terminal and run the following command: ```bash go install github.com/philippta/flyscrape/cmd/flyscrape@latest ``` ## Usage ``` flyscrape is a standalone and scriptable web scraper for efficiently extracting data from websites. Usage: flyscrape [arguments] Commands: new creates a sample scraping script run runs a scraping script dev watches and re-runs a scraping script ``` ## Configuration Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the [documentation page](docs/readme.md#configuration). ```javascript export const config = { url: "https://example.com/", // Specify the URL to start scraping from. depth: 0, // Specify how deep links should be followed. (default = 0, no follow) follow: [], // Speficy the css selectors to follow (default = ["a[href]"]) allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url) blockedDomains: [], // Specify the blocked domains. (default = none) allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed) blockedURLs: [], // Specify the blocked URLs as regex. (default = none) rate: 100, // Specify the rate in requests per second. (default = no rate limit) proxies: [], // Specify the HTTP(S) proxy URLs. (default = no proxy) cache: "file", // Enable file-based request caching. (default = no cache) }; export function setup() { // Optional setup function, called once before scraping starts. // Can be used for authentication. } export default function ({ doc, url, absoluteURL }) { // doc - Contains the parsed HTML document // url - Contains the scraped URL // absoluteURL(...) - Transforms relative URLs into absolute URLs } ``` ## Query API ```javascript //
Hey
const el = doc.find(".element") el.text() // "Hey" el.html() // `
Hey
` el.attr("foo") // "bar" el.hasAttr("foo") // true el.hasClass("element") // true // const list = doc.find("ul") list.children() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ] const items = list.find("li") items.length() // 3 items.first() //
  • Item 1
  • items.last() //
  • Item 3
  • items.get(1) //
  • Item 2
  • items.get(1).prev() //
  • Item 1
  • items.get(1).next() //
  • Item 3
  • items.get(1).parent() // items.get(1).siblings() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ] items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"] items.filter(item => item.hasClass("a")) // [
  • Item 1
  • ] ``` ## Flyscrape API ```javascript import { parse } from "flyscrape"; const doc = parse(`
    bar
    `); const text = doc.find(".foo").text(); ``` ```javascript import http from "flyscrape/http"; const response = http.get("https://example.com") const response = http.postForm("https://example.com", { "username": "foo", "password": "bar", }) const response = http.postJSON("https://example.com", { "username": "foo", "password": "bar", }) // Contents of response { body: "...", status: 200, headers: { "Content-Type": "text/html", // ... }, error": "", } ``` ## Issues and Suggestions If you encounter any issues or have suggestions for improvement, please [submit an issue](https://github.com/philippta/flyscrape/issues).