platonai/PulsarRPA: Automate webpages at scale, scrape web data comple...

= What is Pulsar? Vincent Zhang [email protected] 3.0, July 29, 2022: Pulsar README :toc: :icons: font :url-quickref: https://docs.asciidoctor.org/asciidoc/latest/syntax-quick-reference/

English | link:README-CN.adoc[简体中文]

== Introduction

Pulsar is the ultimate open source solution to scrape web data at scale.

Extracting web data at scale is extremely hard. #Websites change frequently and are becoming more complex, meaning web data collected is often inaccurate or incomplete#, Pulsar provides a series of advanced features to solve this problem.

Pulsar supports the Network As A Database paradigm, so we can turn the Web into tables and charts using simple SQLs, furthermore, we query the web using SQL directly.

We also have a plan to release an advanced AI to automatically extract every field in webpages with notable accuracy.

== Features

Web spider: browser rendering, ajax data crawling
High performance: highly optimized, rendering hundreds of pages in parallel on a single machine without be blocked
Low cost: scraping 100,000 browser rendered e-comm webpages, or n * 10,000,000 data points each day, only 8 core CPU/32G memory are required
Data quantity assurance: smart retry, accurate scheduling, web data lifecycle management
Large scale: fully distributed, designed for large scale crawling
Simple API: single line of code to scrape, or single SQL to turn a website into a table
X-SQL: extended SQL to manage web data: Web crawling, scraping, Web content mining, Web BI
Bot stealth: web driver stealth, IP rotation, privacy context rotation, never get banned
RPA: simulating human behaviors, SPA crawling, or do something else awesome
Big data: various backend storage support: MongoDB/HBase/Gora
Logs & metrics: monitored closely and every event is recorded

== Get started

Most scraping attempt can start with (almost) a single line of code:

Kotlin [source,kotlin,options="nowrap"]

fun main() = PulsarContexts.createSession().scrapeOutPages( "https://www.amazon.com/", "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))

The code above scrapes fields specified by css selectors #title and #acrCustomerReviewText from a set of product pages. The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/sites/topEc/english/amazon/AmazonCrawler.kt[kotlin].

Most #real world# scraping projects can start with the following code snippet:

Kotlin [source,kotlin]

fun main() { val context = PulsarContexts.create()

val parseHandler = { _: WebPage, document: Document ->
    // use the document
    // ...
    // and then extract further hyperlinks
    context.submitAll(document.selectHyperlinks("a[href~=/dp/]"))
}
val urls = LinkExtractors.fromResource("seeds10.txt")
    .map { ParsableHyperlink("$it -refresh", parseHandler) }
context.submitAll(urls).await()

}

The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/ContinuousCrawler.kt[kotlin].

== Core concepts The core Pulsar concepts include the following:

Web Scraping
Network As A Database
Auto Extract
Browser Rendering
Pulsar Context
Pulsar Session
URLs
Hyperlinks
Load Options
Event Handler
X-SQL

Check link:docs/concepts.adoc#_the_core_concepts_of_pulsar[Pulsar concepts] for details.

== Usage: use Pulsar as a library The simplest way to leverage the power of Pulsar is to add it to your project as a library.

Maven: [source,xml]

ai.platon.pulsar pulsar-all 1.9.12 ----

Gradle: [source,kotlin]

implementation("ai.platon.pulsar:pulsar-all:1.9.12")

=== Basic usage

Kotlin

[source,kotlin]

// create a pulsar session val session = PulsarContexts.createSession() // the main url we are playing with val url = "https://list.jd.com/list.html?cat=652,12345,12349" // load a page, fetch it from the web if it has expired or if it's the first time to fetch val page = session.load(url, "-expires 1d") // parse the page content into a Jsoup document val document = session.parse(page) // do something with the document // ...

// or, load and parse val document2 = session.loadDocument(url, "-expires 1d") // do something with the document // ...

// load all pages with links specified by -outLink val pages = session.loadOutPages(url, "-expires 1d -itemExpires 7d -outLink a[href~=item]") // load, parse and scrape fields val fields = session.scrape(url, "-expires 1d", "li[data-sku]", listOf(".p-name em", ".p-price")) // load, parse and scrape named fields val fields2 = session.scrape(url, "-i 1d", "li[data-sku]", mapOf("name" to ".p-name em", "price" to ".p-price"))

The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/BasicUsage.kt[kotlin], link:pulsar-app/pulsar-examples/src/main/java/ai/platon/pulsar/examples/BasicUsage.java[java].

Load options

Note that most of our scraping methods accept a parameter called load arguments, or load options, to control how to load/fetch a webpage.

-expires     // The expiry time of a page
-itemExpires // The expiry time of item pages in some batch scraping methods
-outLink     // The selector for out links to scrape
-refresh     // Force (re)fetch the page, just like hitting the refresh button on a real browser
-parse       // Triger the parse phrase
-resource    // Fetch the url as a resource without browser rendering

Check link:docs/concepts.adoc#_load_options[Load Options] for details.

=== Extracting web data

Pulsar uses https://jsoup.org/[jsoup] to extract data from html documents. Jsoup parses HTML to the same DOM as modern browsers do. Check https://jsoup.org/cookbook/extracting-data/selector-syntax[selector-syntax] for all the supported CSS selectors.

Kotlin

[source,kotlin]

val document = session.loadDocument(url, "-expires 1d") val price = document.selectFirst('.price').text()

=== Continuous crawls It's really simple to scrape a massive url collection or run continuous crawls in Pulsar.

Kotlin

[source,kotlin]

fun main() { val context = PulsarContexts.create()

val parseHandler = { _: WebPage, document: Document ->
    // do something wonderful with the document
    println(document.title() + "\t|\t" + document.baseUri())
}
val urls = LinkExtractors.fromResource("seeds.txt")
    .map { ParsableHyperlink("$it -refresh", parseHandler) }
context.submitAll(urls)
// feel free to submit millions of urls here
context.submitAll(urls)
// ...
context.await()

}

Java

[source,java]

public class ContinuousCrawler {

private static void onParse(WebPage page, Document document) {
    // do something wonderful with the document
    System.out.println(document.title() + "\t|\t" + document.baseUri());
}

public static void main(String[] args) {
    PulsarContext context = PulsarContexts.create();

    List<Hyperlink> urls = LinkExtractors.fromResource("seeds.txt")
            .stream()
            .map(seed -> new ParsableHyperlink(seed, ContinuousCrawler::onParse))
            .collect(Collectors.toList());
    context.submitAll(urls);
    // feel free to submit millions of urls here
    context.submitAll(urls);
    // ...
    context.await();
}

}

The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/MassiveCrawler.kt[kotlin], link:pulsar-app/pulsar-examples/src/main/java/ai/platon/pulsar/examples/ContinuousCrawler.java[java].

=== Use X-SQL to query the web

Scrape a single page:

[source,sql]

select dom_first_text(dom, '#productTitle') as title, dom_first_text(dom, '#bylineInfo') as brand, dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #corePrice_desktop tr td:matches(^Price) ~ td') as price, dom_first_text(dom, '#acrCustomerReviewText') as ratings, str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1s -njr 3', 'body');

Execute the X-SQL:

[source,kotlin]

val context = SQLContexts.create() val rs = context.executeQuery(sql) println(ResultSetFormatter(rs, withHeader = true))

The result is as follows:

TITLE | BRAND | PRICE | RATINGS | SCORE HUAWEI P20 Lite (32GB + 4GB RAM) 5.84" FHD+ Display ... | Visit the HUAWEI Store | $1.9.12 | 1,349 ratings | 4.40

The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/XSQLDemo.kt[kotlin].

== Usage: run Pulsar as a REST service When Pulsar runs as a REST service, X-SQL can be used to scrape webpages or to query the web data directly at anytime, from anywhere, without opening an IDE.

=== Build from source

git clone https://github.com/platonai/pulsar.git cd pulsar && bin/build-run.sh

For Chinese developers, we strongly suggest you to follow link:bin/tools/maven/maven-settings.adoc[this] instruction to accelerate the building.

=== Use X-SQL to query the web

Start the pulsar server if not started:

[source,shell]

bin/pulsar

Scrape a webpage in another terminal window:

[source,shell]

bin/scrape.sh

The bash script is quite simple, just use curl to post an X-SQL: [source,shell]

curl -X POST --location "http://localhost:8182/api/x/e" -H "Content-Type: text/plain" -d " select dom_base_uri(dom) as url, dom_first_text(dom, '#productTitle') as title, str_substring_after(dom_first_href(dom, '#wayfinding-breadcrumbs_container ul li:last-child a'), '&node=') as category, dom_first_slim_html(dom, '#bylineInfo') as brand, cast(dom_all_slim_htmls(dom, '#imageBlock img') as varchar) as gallery, dom_first_slim_html(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width > 400)') as img, dom_first_text(dom, '#price tr td:contains(List Price) ~ td') as listprice, dom_first_text(dom, '#price tr td:matches(^Price) ~ td') as price, str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1d -njr 3', 'body');"

The example code can be found here: link:bin/scrape.sh[bash], link:bin/scrape.bat[batch], link:pulsar-client/src/main/java/ai/platon/pulsar/client/Scraper.java[java], link:pulsar-client/src/main/kotlin/ai/platon/pulsar/client/Scraper.kt[kotlin], link:pulsar-client/src/main/php/Scraper.php[php].

The response is as follows in json format:

[source,json]

{ "uuid": "cc611841-1f2b-4b6b-bcdd-ce822d97a2ad", "statusCode": 200, "pageStatusCode": 200, "pageContentBytes": 1607636, "resultSet": [ { "title": "Tara Toys Ariel Necklace Activity Set - Amazon Exclusive (51394)", "listprice": "$19.99", "price": "$12.99", "categories": "Toys & Games|Arts & Crafts|Craft Kits|Jewelry", "baseuri": "https://www.amazon.com/dp/B00BTX5926" } ], "pageStatus": "OK", "status": "OK" }

== Usage: experience Pulsar in a executable jar

We have released a standalone, executable jar, based on Pulsar. Download link:https://github.com/platonai/exotic#download[Exotic] and enjoy it with a single command line:

java -jar exotic-standalone.jar

== Requirements

Memory 4G+
Maven 3.2+
The latest version of the Java 11 JDK
java and jar on the PATH
Google Chrome 90+

Pulsar is tested on Ubuntu 18.04, Ubuntu 20.04, Windows 7, Windows 11, WSL, any other operating system that meets the requirements should work as well.

== Advanced topics Check link:docs/faq/advanced-topics.adoc[advanced topics] to see answers for the following questions:

What’s so difficult about scraping web data at scale?
How to scrape a million product pages from an e-commerce website a day?
How to scrape pages behind a login?
How to download resources directly within a browser context?
How to scrape a single page application (SPA)? ** Resource mode ** RPA mode
How to make sure all fields are extracted correctly?
How to crawl paginated links?
How to crawl newly discovered links?
How to crawl the entire website?
How to simulate human behaviors?
How to schedule priority tasks?
How to start a task at a fixed time point?
How to drop a scheduled task?
How to know the status of a task?
How to know what's going on in the system?
How to automatically generate the css selectors for fields to scrape?
How to extract content from websites using machine learning automatically with commercial accuracy?
How to scrape amazon.com to match industrial needs?

== Compare with other solutions In general, the features mentioned in the Feature section are well-supported by Pulsar, but other solutions do not.

Check link:docs/faq/solution-comparison.adoc[solution comparison] to see the detailed comparison to the other solutions:

Pulsar vs selenium/puppeteer/playwright
Pulsar vs nutch
Pulsar vs scrapy+splash

== Technical details Check link:docs/faq/technical-details.adoc[technical details] to see answers for the following questions:

How to rotate my ip addresses?
How to hide my bot from being detected?
How & why to simulate human behaviors?
How to render as many pages as possible on a single machine without be blocked?

PulsarRPA
PulsarRPA copied to clipboard

Metadata

Kotlin [source,kotlin,options="nowrap"]

fun main() = PulsarContexts.createSession().scrapeOutPages( "https://www.amazon.com/", "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))

Kotlin [source,kotlin]

}

Maven: [source,xml]

Gradle: [source,kotlin]

implementation("ai.platon.pulsar:pulsar-all:1.9.12")

[source,kotlin]

[source,kotlin]

val document = session.loadDocument(url, "-expires 1d") val price = document.selectFirst('.price').text()

[source,kotlin]

}

[source,java]

}

[source,sql]

[source,kotlin]

val context = SQLContexts.create() val rs = context.executeQuery(sql) println(ResultSetFormatter(rs, withHeader = true))

TITLE | BRAND | PRICE | RATINGS | SCORE HUAWEI P20 Lite (32GB + 4GB RAM) 5.84" FHD+ Display ... | Visit the HUAWEI Store | $1.9.12 | 1,349 ratings | 4.40

=== Build from source

git clone https://github.com/platonai/pulsar.git cd pulsar && bin/build-run.sh

[source,shell]

bin/pulsar

[source,shell]

bin/scrape.sh

The bash script is quite simple, just use curl to post an X-SQL: [source,shell]

[source,json]

← Metadata

Owner

Metadata

PulsarRPA PulsarRPA copied to clipboard

Metadata

Kotlin [source,kotlin,options="nowrap"]

fun main() = PulsarContexts.createSession().scrapeOutPages( "https://www.amazon.com/", "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))

Kotlin [source,kotlin]

}

Maven: [source,xml]

Gradle: [source,kotlin]

implementation("ai.platon.pulsar:pulsar-all:1.9.12")

[source,kotlin]

[source,kotlin]

val document = session.loadDocument(url, "-expires 1d") val price = document.selectFirst('.price').text()

[source,kotlin]

}

[source,java]

}

[source,sql]

[source,kotlin]

val context = SQLContexts.create() val rs = context.executeQuery(sql) println(ResultSetFormatter(rs, withHeader = true))

TITLE | BRAND | PRICE | RATINGS | SCORE HUAWEI P20 Lite (32GB + 4GB RAM) 5.84" FHD+ Display ... | Visit the HUAWEI Store | $1.9.12 | 1,349 ratings | 4.40

=== Build from source

git clone https://github.com/platonai/pulsar.git cd pulsar && bin/build-run.sh

[source,shell]

bin/pulsar

[source,shell]

bin/scrape.sh

The bash script is quite simple, just use curl to post an X-SQL: [source,shell]

[source,json]

← Metadata

Owner

Metadata

PulsarRPA
PulsarRPA copied to clipboard