summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorPhilipp Tanlak <philipp.tanlak@gmail.com>2023-10-12 19:21:38 +0200
committerPhilipp Tanlak <philipp.tanlak@gmail.com>2023-10-12 19:21:38 +0200
commit40f59fa7b19059b441ea766f0de859c6dd52f77e (patch)
tree75a3c6899a0ee758ec9a1cdeb12276d3cdc74ae3 /README.md
parentdfbacde1fdb95452233308731c0670abf3ac94bf (diff)
Add filter function and update readmev0.2.0
Diffstat (limited to 'README.md')
-rw-r--r--README.md179
1 files changed, 137 insertions, 42 deletions
diff --git a/README.md b/README.md
index ef8182a..303371b 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,76 @@
-# flyscrape
+<br />
-flyscrape is an elegant scraping tool for efficiently extracting data from websites. Whether you're a developer, data analyst, or researcher, flyscrape empowers you to effortlessly gather information from web pages and transform it into structured data.
+<p align="center">
+
+<picture>
+ <source media="(prefers-color-scheme: dark)" srcset="docs/logo-alt.png">
+ <source media="(prefers-color-scheme: light)" srcset="docs/logo.png">
+ <img width="200" src="docs/logo.png">
+</picture>
+
+</p>
+
+<br />
+
+<p align="center">
+<b>flyscrape</b> is an expressive and elegant web scraper, combining the speed of Go with the <br/> flexibility of JavaScript. — Focus on data extraction rather than request juggling.
+</p>
+
+<br />
## Features
-- **Simple and Intuitive**: **flyscrape** offers an easy-to-use command-line interface that allows you to interact with scraping scripts effortlessly.
+- Domains and URL filtering
+- Depth control
+- Request caching
+- Rate limiting
+- Development mode
+- Single binary executable
-- **Create New Scripts**: The `new` command enables you to generate sample scraping scripts quickly, providing you with a solid starting point for your scraping endeavors.
-- **Run Scripts**: Execute your scraping script using the `run` command, and watch as **flyscrape** retrieves and processes data from the specified website.
+## Example script
+
+```javascript
+export const config = {
+ url: "https://news.ycombinator.com/",
+}
+
+export default function ({ doc, absoluteURL }) {
+ const title = doc.find("title");
+ const posts = doc.find(".athing");
+
+ return {
+ title: title.text(),
+ posts: posts.map((post) => {
+ const link = post.find(".titleline > a");
+
+ return {
+ title: link.text(),
+ url: link.attr("href"),
+ };
+ }),
+ }
+}
+```
-- **Watch for Development**: The `dev` command allows you to watch your scraping script for changes and quickly iterate during development, helping you find the right data extraction queries.
+```bash
+$ flyscrape run hackernews.js
+[
+ {
+ "title": "Hacker News",
+ "url": "https://news.ycombinator.com/",
+ "data": {
+ "posts": [
+ {
+ "title": "Show HN: flyscrape - An expressive and elegant web scraper",
+ "url": "https://flyscrape.com"
+ },
+ ...
+ ],
+ }
+ }
+]
+```
## Installation
@@ -26,66 +86,101 @@ To install **flyscrape**, follow these simple steps:
## Usage
-**flyscrape** offers several commands to assist you in your scraping journey:
+```
+$ flyscrape
+flyscrape is an elegant scraping tool for efficiently extracting data from websites.
-### Creating a New Script
+Usage:
-Use the `new` command to create a new scraping script:
+ flyscrape <command> [arguments]
+
+Commands:
+
+ new creates a sample scraping script
+ run runs a scraping script
+ dev watches and re-runs a scraping script
-```bash
-flyscrape new example.js
```
-### Running a Script
+### Create a new sample scraping script
-Execute your scraping script using the `run` command:
+The `new` command allows you to create a new boilerplate sample script which helps you getting started.
-```bash
-flyscrape run example.js
+```
+flyscrape new example.js
```
-### Watching for Development
+### Watch the script for changes during development
-The `dev` command allows you to watch your scraping script for changes and quickly iterate during development:
+The `dev` command allows you to watch your scraping script for changes and quickly iterate during development. In development mode, flyscrape will not follow any links and request caching is enabled.
-```bash
+```
flyscrape dev example.js
```
-## Example Script
+### Run the scraping script
+
+The `dev` command allows you to run your script to its fullest extend.
+
+```
+flyscrape run example.js
+```
+
+## Configuration
Below is an example scraping script that showcases the capabilities of **flyscrape**:
```javascript
export const config = {
- url: "https://news.ycombinator.com/", // Specify the URL to start scraping from.
- // depth: 0, // Specify how deep links should be followed. (default = 0, no follow)
- // allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url)
- // blockedDomains: [], // Specify the blocked domains. (default = none)
- // allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed)
- // blockedURLs: [], // Specify the blocked URLs as regex. (default = non blocked)
- // rate: 100, // Specify the rate in requests per second. (default = 100)
- // cache: "file", // Enable file-based request caching. (default = no cache)
+ url: "https://example.com/", // Specify the URL to start scraping from.
+ depth: 0, // Specify how deep links should be followed. (default = 0, no follow)
+ allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url)
+ blockedDomains: [], // Specify the blocked domains. (default = none)
+ allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed)
+ blockedURLs: [], // Specify the blocked URLs as regex. (default = none)
+ rate: 100, // Specify the rate in requests per second. (default = no rate limit)
+ cache: "file", // Enable file-based request caching. (default = no cache)
};
-export default function({ doc, absoluteURL }) {
- const title = doc.find("title");
- const posts = doc.find(".athing");
-
- return {
- title: title.text(),
- posts: posts.map((post) => {
- const link = post.find(".titleline > a");
-
- return {
- title: link.text(),
- url: absoluteURL(link.attr("href")),
- };
- }),
- };
+export default function ({ doc, url, absoluteURL }) {
+ // doc - Contains the parsed HTML document
+ // url - Contains the scraped URL
+ // absoluteURL(...) - Transforms relative URLs into absolute URLs
}
```
+## Query API
+
+```javascript
+// <div class="element" foo="bar">Hey</div>
+const el = doc.find(".element")
+el.text() // "Hey"
+el.html() // `<div class="element">Hey</div>`
+el.attr("foo") // "bar"
+el.hasAttr("foo") // true
+el.hasClass("element") // true
+
+// <ul>
+// <li class="a">Item 1</li>
+// <li>Item 2</li>
+// <li>Item 3</li>
+// </ul>
+const list = doc.find("ul")
+list.children() // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]
+
+const items = list.find("li")
+items.length() // 3
+items.first() // <li>Item 1</li>
+items.last() // <li>Item 3</li>
+items.get(1) // <li>Item 2</li>
+items.get(1).prev() // <li>Item 1</li>
+items.get(1).next() // <li>Item 3</li>
+items.get(1).parent() // <ul>...</ul>
+items.get(1).siblings() // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]
+items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"]
+items.filter(item => item.hasClass("a")) // [<li class="a">Item 1</li>]
+```
+
## Contributing
We welcome contributions from the community! If you encounter any issues or have suggestions for improvement, please [submit an issue](https://github.com/philippta/flyscrape/issues).