flyscrape is a standalone and scriptable web scraper, combining the speed of Go with the flexibility of JavaScript. — Focus on data extraction rather than request juggling.
## Features
- **Highly Configurable:** 10 options to fine-tune your scraper.
- **Standalone:** flyscrape comes as a single binary executable.
- **Scriptable:** Use JavaScript to write your data extraction logic.
- **Simple API:** Extract data from HTML pages with a familiar API.
- **Fast Iteration:** Use the development mode to get quick feedback.
- **Request Caching:** Re-run scripts on websites you already scraped.
- **Zero Dependencies:** No need to fill up your disk with npm packages.
## Example script
```javascript
export const config = {
url: "https://news.ycombinator.com/",
}
export default function ({ doc, absoluteURL }) {
const title = doc.find("title");
const posts = doc.find(".athing");
return {
title: title.text(),
posts: posts.map((post) => {
const link = post.find(".titleline > a");
return {
title: link.text(),
url: link.attr("href"),
};
}),
}
}
```
```bash
$ flyscrape run hackernews.js
[
{
"url": "https://news.ycombinator.com/",
"data": {
"title": "Hacker News",
"posts": [
{
"title": "Show HN: flyscrape - An standalone and scriptable web scraper",
"url": "https://flyscrape.com/"
},
...
]
}
}
]
```
Check out the [examples folder](examples) for more detailed examples.
## Installation
### Pre-compiled binary
`flyscrape` is available for MacOS, Linux and Windows as a downloadable binary from the [releases page](https://github.com/philippta/flyscrape/releases).
### Compile from source
To compile flyscrape from source, follow these steps:
1. Install Go: Make sure you have Go installed on your system. If not, you can download it from [https://golang.org/](https://golang.org/).
2. Install flyscrape: Open a terminal and run the following command:
```bash
go install github.com/philippta/flyscrape/cmd/flyscrape@latest
```
## Usage
```
flyscrape is a standalone and scriptable web scraper for efficiently extracting data from websites.
Usage:
flyscrape [arguments]
Commands:
new creates a sample scraping script
run runs a scraping script
dev watches and re-runs a scraping script
```
## Configuration
Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the [documentation page](docs/readme.md#configuration).
```javascript
export const config = {
url: "https://example.com/", // Specify the URL to start scraping from.
depth: 0, // Specify how deep links should be followed. (default = 0, no follow)
follow: [], // Speficy the css selectors to follow (default = ["a[href]"])
allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url)
blockedDomains: [], // Specify the blocked domains. (default = none)
allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed)
blockedURLs: [], // Specify the blocked URLs as regex. (default = none)
rate: 100, // Specify the rate in requests per second. (default = no rate limit)
proxies: [], // Specify the HTTP(S) proxy URLs. (default = no proxy)
cache: "file", // Enable file-based request caching. (default = no cache)
};
export function setup() {
// Optional setup function, called once before scraping starts.
// Can be used for authentication.
}
export default function ({ doc, url, absoluteURL }) {
// doc - Contains the parsed HTML document
// url - Contains the scraped URL
// absoluteURL(...) - Transforms relative URLs into absolute URLs
}
```
## Query API
```javascript
//
Hey
const el = doc.find(".element")
el.text() // "Hey"
el.html() // `