PulsarRPA
PulsarRPA copied to clipboard
Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.
= What is Pulsar? Vincent Zhang [email protected] 3.0, July 29, 2022: Pulsar README :toc: :icons: font :url-quickref: https://docs.asciidoctor.org/asciidoc/latest/syntax-quick-reference/
English | link:README-CN.adoc[简体中文]
== Introduction
Pulsar is the ultimate open source solution to scrape web data at scale.
Extracting web data at scale is extremely hard. #Websites change frequently and are becoming more complex, meaning web data collected is often inaccurate or incomplete#, Pulsar provides a series of advanced features to solve this problem.
Pulsar supports the Network As A Database paradigm, so we can turn the Web into tables and charts using simple SQLs, furthermore, we query the web using SQL directly.
We also have a plan to release an advanced AI to automatically extract every field in webpages with notable accuracy.
== Features
- Web spider: browser rendering, ajax data crawling
- High performance: highly optimized, rendering hundreds of pages in parallel on a single machine without be blocked
- Low cost: scraping 100,000 browser rendered e-comm webpages, or n * 10,000,000 data points each day, only 8 core CPU/32G memory are required
- Data quantity assurance: smart retry, accurate scheduling, web data lifecycle management
- Large scale: fully distributed, designed for large scale crawling
- Simple API: single line of code to scrape, or single SQL to turn a website into a table
- X-SQL: extended SQL to manage web data: Web crawling, scraping, Web content mining, Web BI
- Bot stealth: web driver stealth, IP rotation, privacy context rotation, never get banned
- RPA: simulating human behaviors, SPA crawling, or do something else awesome
- Big data: various backend storage support: MongoDB/HBase/Gora
- Logs & metrics: monitored closely and every event is recorded
== Get started
Most scraping attempt can start with (almost) a single line of code:
Kotlin [source,kotlin,options="nowrap"]
fun main() = PulsarContexts.createSession().scrapeOutPages( "https://www.amazon.com/", "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))
The code above scrapes fields specified by css selectors #title and #acrCustomerReviewText from a set of product pages. The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/sites/topEc/english/amazon/AmazonCrawler.kt[kotlin].
Most #real world# scraping projects can start with the following code snippet:
Kotlin [source,kotlin]
fun main() { val context = PulsarContexts.create()
val parseHandler = { _: WebPage, document: Document ->
// use the document
// ...
// and then extract further hyperlinks
context.submitAll(document.selectHyperlinks("a[href~=/dp/]"))
}
val urls = LinkExtractors.fromResource("seeds10.txt")
.map { ParsableHyperlink("$it -refresh", parseHandler) }
context.submitAll(urls).await()
}
The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/ContinuousCrawler.kt[kotlin].
== Core concepts The core Pulsar concepts include the following:
- Web Scraping
- Network As A Database
- Auto Extract
- Browser Rendering
- Pulsar Context
- Pulsar Session
- URLs
- Hyperlinks
- Load Options
- Event Handler
- X-SQL
Check link:docs/concepts.adoc#_the_core_concepts_of_pulsar[Pulsar concepts] for details.
== Usage: use Pulsar as a library The simplest way to leverage the power of Pulsar is to add it to your project as a library.
Maven: [source,xml]
Gradle: [source,kotlin]
implementation("ai.platon.pulsar:pulsar-all:1.9.12")
=== Basic usage
Kotlin
[source,kotlin]
// create a pulsar session val session = PulsarContexts.createSession() // the main url we are playing with val url = "https://list.jd.com/list.html?cat=652,12345,12349" // load a page, fetch it from the web if it has expired or if it's the first time to fetch val page = session.load(url, "-expires 1d") // parse the page content into a Jsoup document val document = session.parse(page) // do something with the document // ...
// or, load and parse val document2 = session.loadDocument(url, "-expires 1d") // do something with the document // ...
// load all pages with links specified by -outLink val pages = session.loadOutPages(url, "-expires 1d -itemExpires 7d -outLink a[href~=item]") // load, parse and scrape fields val fields = session.scrape(url, "-expires 1d", "li[data-sku]", listOf(".p-name em", ".p-price")) // load, parse and scrape named fields val fields2 = session.scrape(url, "-i 1d", "li[data-sku]", mapOf("name" to ".p-name em", "price" to ".p-price"))
The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/BasicUsage.kt[kotlin], link:pulsar-app/pulsar-examples/src/main/java/ai/platon/pulsar/examples/BasicUsage.java[java].
Load options
Note that most of our scraping methods accept a parameter called load arguments, or load options, to control how to load/fetch a webpage.
-expires // The expiry time of a page
-itemExpires // The expiry time of item pages in some batch scraping methods
-outLink // The selector for out links to scrape
-refresh // Force (re)fetch the page, just like hitting the refresh button on a real browser
-parse // Triger the parse phrase
-resource // Fetch the url as a resource without browser rendering
Check link:docs/concepts.adoc#_load_options[Load Options] for details.
=== Extracting web data
Pulsar uses https://jsoup.org/[jsoup] to extract data from html documents. Jsoup parses HTML to the same DOM as modern browsers do. Check https://jsoup.org/cookbook/extracting-data/selector-syntax[selector-syntax] for all the supported CSS selectors.
Kotlin
[source,kotlin]
val document = session.loadDocument(url, "-expires 1d") val price = document.selectFirst('.price').text()
=== Continuous crawls It's really simple to scrape a massive url collection or run continuous crawls in Pulsar.
Kotlin
[source,kotlin]
fun main() { val context = PulsarContexts.create()
val parseHandler = { _: WebPage, document: Document ->
// do something wonderful with the document
println(document.title() + "\t|\t" + document.baseUri())
}
val urls = LinkExtractors.fromResource("seeds.txt")
.map { ParsableHyperlink("$it -refresh", parseHandler) }
context.submitAll(urls)
// feel free to submit millions of urls here
context.submitAll(urls)
// ...
context.await()
}
Java
[source,java]
public class ContinuousCrawler {
private static void onParse(WebPage page, Document document) {
// do something wonderful with the document
System.out.println(document.title() + "\t|\t" + document.baseUri());
}
public static void main(String[] args) {
PulsarContext context = PulsarContexts.create();
List<Hyperlink> urls = LinkExtractors.fromResource("seeds.txt")
.stream()
.map(seed -> new ParsableHyperlink(seed, ContinuousCrawler::onParse))
.collect(Collectors.toList());
context.submitAll(urls);
// feel free to submit millions of urls here
context.submitAll(urls);
// ...
context.await();
}
}
The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/MassiveCrawler.kt[kotlin], link:pulsar-app/pulsar-examples/src/main/java/ai/platon/pulsar/examples/ContinuousCrawler.java[java].
=== Use X-SQL to query the web
Scrape a single page:
[source,sql]
select dom_first_text(dom, '#productTitle') as title, dom_first_text(dom, '#bylineInfo') as brand, dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #corePrice_desktop tr td:matches(^Price) ~ td') as price, dom_first_text(dom, '#acrCustomerReviewText') as ratings, str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1s -njr 3', 'body');
Execute the X-SQL:
[source,kotlin]
val context = SQLContexts.create() val rs = context.executeQuery(sql) println(ResultSetFormatter(rs, withHeader = true))
The result is as follows:
TITLE | BRAND | PRICE | RATINGS | SCORE HUAWEI P20 Lite (32GB + 4GB RAM) 5.84" FHD+ Display ... | Visit the HUAWEI Store | $1.9.12 | 1,349 ratings | 4.40
The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/XSQLDemo.kt[kotlin].
== Usage: run Pulsar as a REST service When Pulsar runs as a REST service, X-SQL can be used to scrape webpages or to query the web data directly at anytime, from anywhere, without opening an IDE.
=== Build from source
git clone https://github.com/platonai/pulsar.git cd pulsar && bin/build-run.sh
For Chinese developers, we strongly suggest you to follow link:bin/tools/maven/maven-settings.adoc[this] instruction to accelerate the building.
=== Use X-SQL to query the web
Start the pulsar server if not started:
[source,shell]
bin/pulsar
Scrape a webpage in another terminal window:
[source,shell]
bin/scrape.sh
The bash script is quite simple, just use curl to post an X-SQL: [source,shell]
curl -X POST --location "http://localhost:8182/api/x/e" -H "Content-Type: text/plain" -d " select dom_base_uri(dom) as url, dom_first_text(dom, '#productTitle') as title, str_substring_after(dom_first_href(dom, '#wayfinding-breadcrumbs_container ul li:last-child a'), '&node=') as category, dom_first_slim_html(dom, '#bylineInfo') as brand, cast(dom_all_slim_htmls(dom, '#imageBlock img') as varchar) as gallery, dom_first_slim_html(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width > 400)') as img, dom_first_text(dom, '#price tr td:contains(List Price) ~ td') as listprice, dom_first_text(dom, '#price tr td:matches(^Price) ~ td') as price, str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1d -njr 3', 'body');"
The example code can be found here: link:bin/scrape.sh[bash], link:bin/scrape.bat[batch], link:pulsar-client/src/main/java/ai/platon/pulsar/client/Scraper.java[java], link:pulsar-client/src/main/kotlin/ai/platon/pulsar/client/Scraper.kt[kotlin], link:pulsar-client/src/main/php/Scraper.php[php].
The response is as follows in json format:
[source,json]
{ "uuid": "cc611841-1f2b-4b6b-bcdd-ce822d97a2ad", "statusCode": 200, "pageStatusCode": 200, "pageContentBytes": 1607636, "resultSet": [ { "title": "Tara Toys Ariel Necklace Activity Set - Amazon Exclusive (51394)", "listprice": "$19.99", "price": "$12.99", "categories": "Toys & Games|Arts & Crafts|Craft Kits|Jewelry", "baseuri": "https://www.amazon.com/dp/B00BTX5926" } ], "pageStatus": "OK", "status": "OK" }
== Usage: experience Pulsar in a executable jar
We have released a standalone, executable jar, based on Pulsar. Download link:https://github.com/platonai/exotic#download[Exotic] and enjoy it with a single command line:
java -jar exotic-standalone.jar
== Requirements
- Memory 4G+
- Maven 3.2+
- The latest version of the Java 11 JDK
- java and jar on the PATH
- Google Chrome 90+
Pulsar is tested on Ubuntu 18.04, Ubuntu 20.04, Windows 7, Windows 11, WSL, any other operating system that meets the requirements should work as well.
== Advanced topics Check link:docs/faq/advanced-topics.adoc[advanced topics] to see answers for the following questions:
- What’s so difficult about scraping web data at scale?
- How to scrape a million product pages from an e-commerce website a day?
- How to scrape pages behind a login?
- How to download resources directly within a browser context?
- How to scrape a single page application (SPA)? ** Resource mode ** RPA mode
- How to make sure all fields are extracted correctly?
- How to crawl paginated links?
- How to crawl newly discovered links?
- How to crawl the entire website?
- How to simulate human behaviors?
- How to schedule priority tasks?
- How to start a task at a fixed time point?
- How to drop a scheduled task?
- How to know the status of a task?
- How to know what's going on in the system?
- How to automatically generate the css selectors for fields to scrape?
- How to extract content from websites using machine learning automatically with commercial accuracy?
- How to scrape amazon.com to match industrial needs?
== Compare with other solutions In general, the features mentioned in the Feature section are well-supported by Pulsar, but other solutions do not.
Check link:docs/faq/solution-comparison.adoc[solution comparison] to see the detailed comparison to the other solutions:
- Pulsar vs selenium/puppeteer/playwright
- Pulsar vs nutch
- Pulsar vs scrapy+splash
== Technical details Check link:docs/faq/technical-details.adoc[technical details] to see answers for the following questions:
- How to rotate my ip addresses?
- How to hide my bot from being detected?
- How & why to simulate human behaviors?
- How to render as many pages as possible on a single machine without be blocked?