PulsarRPA icon indicating copy to clipboard operation
PulsarRPA copied to clipboard

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.

= What is Pulsar? Vincent Zhang [email protected] 3.0, July 29, 2022: Pulsar README :toc: :icons: font :url-quickref: https://docs.asciidoctor.org/asciidoc/latest/syntax-quick-reference/

English | link:README-CN.adoc[简体中文]

== Introduction

Pulsar is the ultimate open source solution to scrape web data at scale.

Extracting web data at scale is extremely hard. #Websites change frequently and are becoming more complex, meaning web data collected is often inaccurate or incomplete#, Pulsar provides a series of advanced features to solve this problem.

Pulsar supports the Network As A Database paradigm, so we can turn the Web into tables and charts using simple SQLs, furthermore, we query the web using SQL directly.

We also have a plan to release an advanced AI to automatically extract every field in webpages with notable accuracy.

== Features

  • Web spider: browser rendering, ajax data crawling
  • High performance: highly optimized, rendering hundreds of pages in parallel on a single machine without be blocked
  • Low cost: scraping 100,000 browser rendered e-comm webpages, or n * 10,000,000 data points each day, only 8 core CPU/32G memory are required
  • Data quantity assurance: smart retry, accurate scheduling, web data lifecycle management
  • Large scale: fully distributed, designed for large scale crawling
  • Simple API: single line of code to scrape, or single SQL to turn a website into a table
  • X-SQL: extended SQL to manage web data: Web crawling, scraping, Web content mining, Web BI
  • Bot stealth: web driver stealth, IP rotation, privacy context rotation, never get banned
  • RPA: simulating human behaviors, SPA crawling, or do something else awesome
  • Big data: various backend storage support: MongoDB/HBase/Gora
  • Logs & metrics: monitored closely and every event is recorded

== Get started

Most scraping attempt can start with (almost) a single line of code:

Kotlin [source,kotlin,options="nowrap"]

fun main() = PulsarContexts.createSession().scrapeOutPages( "https://www.amazon.com/", "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))

The code above scrapes fields specified by css selectors #title and #acrCustomerReviewText from a set of product pages. The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/sites/topEc/english/amazon/AmazonCrawler.kt[kotlin].

Most #real world# scraping projects can start with the following code snippet:

Kotlin [source,kotlin]

fun main() { val context = PulsarContexts.create()

val parseHandler = { _: WebPage, document: Document ->
    // use the document
    // ...
    // and then extract further hyperlinks
    context.submitAll(document.selectHyperlinks("a[href~=/dp/]"))
}
val urls = LinkExtractors.fromResource("seeds10.txt")
    .map { ParsableHyperlink("$it -refresh", parseHandler) }
context.submitAll(urls).await()

}

The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/ContinuousCrawler.kt[kotlin].

== Core concepts The core Pulsar concepts include the following:

  • Web Scraping
  • Network As A Database
  • Auto Extract
  • Browser Rendering
  • Pulsar Context
  • Pulsar Session
  • URLs
  • Hyperlinks
  • Load Options
  • Event Handler
  • X-SQL

Check link:docs/concepts.adoc#_the_core_concepts_of_pulsar[Pulsar concepts] for details.

== Usage: use Pulsar as a library The simplest way to leverage the power of Pulsar is to add it to your project as a library.

Maven: [source,xml]

ai.platon.pulsar pulsar-all 1.9.12 ----

Gradle: [source,kotlin]

implementation("ai.platon.pulsar:pulsar-all:1.9.12")

=== Basic usage

Kotlin

[source,kotlin]

// create a pulsar session val session = PulsarContexts.createSession() // the main url we are playing with val url = "https://list.jd.com/list.html?cat=652,12345,12349" // load a page, fetch it from the web if it has expired or if it's the first time to fetch val page = session.load(url, "-expires 1d") // parse the page content into a Jsoup document val document = session.parse(page) // do something with the document // ...

// or, load and parse val document2 = session.loadDocument(url, "-expires 1d") // do something with the document // ...

// load all pages with links specified by -outLink val pages = session.loadOutPages(url, "-expires 1d -itemExpires 7d -outLink a[href~=item]") // load, parse and scrape fields val fields = session.scrape(url, "-expires 1d", "li[data-sku]", listOf(".p-name em", ".p-price")) // load, parse and scrape named fields val fields2 = session.scrape(url, "-i 1d", "li[data-sku]", mapOf("name" to ".p-name em", "price" to ".p-price"))

The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/BasicUsage.kt[kotlin], link:pulsar-app/pulsar-examples/src/main/java/ai/platon/pulsar/examples/BasicUsage.java[java].

Load options

Note that most of our scraping methods accept a parameter called load arguments, or load options, to control how to load/fetch a webpage.

-expires     // The expiry time of a page
-itemExpires // The expiry time of item pages in some batch scraping methods
-outLink     // The selector for out links to scrape
-refresh     // Force (re)fetch the page, just like hitting the refresh button on a real browser
-parse       // Triger the parse phrase
-resource    // Fetch the url as a resource without browser rendering

Check link:docs/concepts.adoc#_load_options[Load Options] for details.

=== Extracting web data

Pulsar uses https://jsoup.org/[jsoup] to extract data from html documents. Jsoup parses HTML to the same DOM as modern browsers do. Check https://jsoup.org/cookbook/extracting-data/selector-syntax[selector-syntax] for all the supported CSS selectors.

Kotlin

[source,kotlin]

val document = session.loadDocument(url, "-expires 1d") val price = document.selectFirst('.price').text()

=== Continuous crawls It's really simple to scrape a massive url collection or run continuous crawls in Pulsar.

Kotlin

[source,kotlin]

fun main() { val context = PulsarContexts.create()

val parseHandler = { _: WebPage, document: Document ->
    // do something wonderful with the document
    println(document.title() + "\t|\t" + document.baseUri())
}
val urls = LinkExtractors.fromResource("seeds.txt")
    .map { ParsableHyperlink("$it -refresh", parseHandler) }
context.submitAll(urls)
// feel free to submit millions of urls here
context.submitAll(urls)
// ...
context.await()

}

Java

[source,java]

public class ContinuousCrawler {

private static void onParse(WebPage page, Document document) {
    // do something wonderful with the document
    System.out.println(document.title() + "\t|\t" + document.baseUri());
}

public static void main(String[] args) {
    PulsarContext context = PulsarContexts.create();

    List<Hyperlink> urls = LinkExtractors.fromResource("seeds.txt")
            .stream()
            .map(seed -> new ParsableHyperlink(seed, ContinuousCrawler::onParse))
            .collect(Collectors.toList());
    context.submitAll(urls);
    // feel free to submit millions of urls here
    context.submitAll(urls);
    // ...
    context.await();
}

}

The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/MassiveCrawler.kt[kotlin], link:pulsar-app/pulsar-examples/src/main/java/ai/platon/pulsar/examples/ContinuousCrawler.java[java].

=== Use X-SQL to query the web

Scrape a single page:

[source,sql]

select dom_first_text(dom, '#productTitle') as title, dom_first_text(dom, '#bylineInfo') as brand, dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #corePrice_desktop tr td:matches(^Price) ~ td') as price, dom_first_text(dom, '#acrCustomerReviewText') as ratings, str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1s -njr 3', 'body');

Execute the X-SQL:

[source,kotlin]

val context = SQLContexts.create() val rs = context.executeQuery(sql) println(ResultSetFormatter(rs, withHeader = true))

The result is as follows:


TITLE | BRAND | PRICE | RATINGS | SCORE HUAWEI P20 Lite (32GB + 4GB RAM) 5.84" FHD+ Display ... | Visit the HUAWEI Store | $1.9.12 | 1,349 ratings | 4.40

The example code can be found here: link:pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/XSQLDemo.kt[kotlin].

== Usage: run Pulsar as a REST service When Pulsar runs as a REST service, X-SQL can be used to scrape webpages or to query the web data directly at anytime, from anywhere, without opening an IDE.

=== Build from source

git clone https://github.com/platonai/pulsar.git cd pulsar && bin/build-run.sh

For Chinese developers, we strongly suggest you to follow link:bin/tools/maven/maven-settings.adoc[this] instruction to accelerate the building.

=== Use X-SQL to query the web

Start the pulsar server if not started:

[source,shell]

bin/pulsar

Scrape a webpage in another terminal window:

[source,shell]

bin/scrape.sh

The bash script is quite simple, just use curl to post an X-SQL: [source,shell]

curl -X POST --location "http://localhost:8182/api/x/e" -H "Content-Type: text/plain" -d " select dom_base_uri(dom) as url, dom_first_text(dom, '#productTitle') as title, str_substring_after(dom_first_href(dom, '#wayfinding-breadcrumbs_container ul li:last-child a'), '&node=') as category, dom_first_slim_html(dom, '#bylineInfo') as brand, cast(dom_all_slim_htmls(dom, '#imageBlock img') as varchar) as gallery, dom_first_slim_html(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width > 400)') as img, dom_first_text(dom, '#price tr td:contains(List Price) ~ td') as listprice, dom_first_text(dom, '#price tr td:matches(^Price) ~ td') as price, str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1d -njr 3', 'body');"

The example code can be found here: link:bin/scrape.sh[bash], link:bin/scrape.bat[batch], link:pulsar-client/src/main/java/ai/platon/pulsar/client/Scraper.java[java], link:pulsar-client/src/main/kotlin/ai/platon/pulsar/client/Scraper.kt[kotlin], link:pulsar-client/src/main/php/Scraper.php[php].

The response is as follows in json format:

[source,json]

{ "uuid": "cc611841-1f2b-4b6b-bcdd-ce822d97a2ad", "statusCode": 200, "pageStatusCode": 200, "pageContentBytes": 1607636, "resultSet": [ { "title": "Tara Toys Ariel Necklace Activity Set - Amazon Exclusive (51394)", "listprice": "$19.99", "price": "$12.99", "categories": "Toys & Games|Arts & Crafts|Craft Kits|Jewelry", "baseuri": "https://www.amazon.com/dp/B00BTX5926" } ], "pageStatus": "OK", "status": "OK" }

== Usage: experience Pulsar in a executable jar

We have released a standalone, executable jar, based on Pulsar. Download link:https://github.com/platonai/exotic#download[Exotic] and enjoy it with a single command line:

java -jar exotic-standalone.jar

== Requirements

  • Memory 4G+
  • Maven 3.2+
  • The latest version of the Java 11 JDK
  • java and jar on the PATH
  • Google Chrome 90+

Pulsar is tested on Ubuntu 18.04, Ubuntu 20.04, Windows 7, Windows 11, WSL, any other operating system that meets the requirements should work as well.

== Advanced topics Check link:docs/faq/advanced-topics.adoc[advanced topics] to see answers for the following questions:

  • What’s so difficult about scraping web data at scale?
  • How to scrape a million product pages from an e-commerce website a day?
  • How to scrape pages behind a login?
  • How to download resources directly within a browser context?
  • How to scrape a single page application (SPA)? ** Resource mode ** RPA mode
  • How to make sure all fields are extracted correctly?
  • How to crawl paginated links?
  • How to crawl newly discovered links?
  • How to crawl the entire website?
  • How to simulate human behaviors?
  • How to schedule priority tasks?
  • How to start a task at a fixed time point?
  • How to drop a scheduled task?
  • How to know the status of a task?
  • How to know what's going on in the system?
  • How to automatically generate the css selectors for fields to scrape?
  • How to extract content from websites using machine learning automatically with commercial accuracy?
  • How to scrape amazon.com to match industrial needs?

== Compare with other solutions In general, the features mentioned in the Feature section are well-supported by Pulsar, but other solutions do not.

Check link:docs/faq/solution-comparison.adoc[solution comparison] to see the detailed comparison to the other solutions:

  • Pulsar vs selenium/puppeteer/playwright
  • Pulsar vs nutch
  • Pulsar vs scrapy+splash

== Technical details Check link:docs/faq/technical-details.adoc[technical details] to see answers for the following questions:

  • How to rotate my ip addresses?
  • How to hide my bot from being detected?
  • How & why to simulate human behaviors?
  • How to render as many pages as possible on a single machine without be blocked?