sparkler icon indicating copy to clipboard operation
sparkler copied to clipboard

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Results 56 sparkler issues
Sort by recently updated
recently updated
newest added

#### Task Description This is a task that is currently being worked on in order to provide Elasticsearch as a backend storage engine option for Sparkler. This builds upon the...

The [CI build if failing](https://github.com/USCDataScience/sparkler/runs/2160833582?check_suite_focus=true) ``` #10 35.54 Collecting scipy==1.4.1 #10 35.56 Downloading scipy-1.4.1.tar.gz (24.6 MB) #10 38.26 Installing build dependencies: started #10 104.0 Installing build dependencies: still running... #10...

Would something like Apache Beam, be a more modern way of doing the same Spark stuff but in an agnostic fashion? This would allow us to be less dependant on...

As part of the Elasticsearch for Sparkler set of issues, @lewismc requested the team create separate maven profiles for the solr and elasticsearch dependencies so we don't pull in unnecessary...

The first task is defining and expressing the **forcus crawling** specification. The second subtask will be implementing that specification in sparkler. Currently, we have support for URL based focus/filters. this...

First, thanks for the project. Sounds great. I am wondering if there is any chance to extract particular text items and images from web pages and map these extracted fields...

enhancement
Discussion
volunteer wanted

## Background: Injector uses a URLValidator utility to validate urls before injection ## Problem URL validator used in injector is too strict, often times not passing valid urls. Example: we...

There are two FIXME: in configuration: First, support loading `sparkler-defaults.yaml` and `sparkler-site.yaml`. The common practice is `*-default.yaml` provides default and recommended values from developers. The `*-site.yaml` should beused by users...

Some high level remaining tasks: - [x] Add solr relation - [x] Pick up spark details from relation - [x] Pick up solr details from relation - [x] Finish write...

+ review if this can be generalised as `Parser` + Generalise schema to fit all possible extractions that may come up in the future