sparkler
sparkler copied to clipboard
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Bumps [tmpl](https://github.com/daaku/nodejs-tmpl) from 1.0.4 to 1.0.5. Commits See full diff in compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter...
Bumps [jetty-server](https://github.com/eclipse/jetty.project) from 9.4.0.v20161208 to 9.4.41.v20210516. Release notes Sourced from jetty-server's releases. 9.4.41.v20210516 Changelog This release resolves CVE-2021-28169 #6099 Cipher preference may break SNI if certificates have different key types...
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.4.4 to 1.5.10. Commits 8cd4c6c 1.5.10 ce7a01f [fix] Improve handling of empty port 0071490 [doc] Update JSDoc comment a7044e3 [minor] Use more descriptive variable name d547792 [security]...
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.9. Commits 13136e9 Release version 1.14.9 of the npm package. 2ec9b0b Keep headers when upgrading from HTTP to HTTPS. 5fc74dd Reduce nesting. 3d81dc3 Release version...
Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=<>] unknown field 'contenthash'
#### Issue Description I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and...
#### Issue Description Build fails ``` [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for sparkler-parent 0.2.2-SNAPSHOT: [INFO] [INFO] sparkler-parent .................................... SUCCESS [ 0.003 s] [INFO] sparkler-tests-base ................................ SUCCESS [ 1.374 s] [INFO]...
I dunno if there is anything obvious that springs to mind here @thammegowda or @karanjeets from back in the day. When I run Sparkler as a spark submit job on...
**Task Description** Most of the Elasticsearch implementation has already been written. There are still two major problems that need to be resolved: 1. [ElasticsearchResultIterator](https://github.com/felixloesing/sparkler/blob/8aad32886b223bd89ae9a3a27aa883bfdb730a2b/sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/storage/elasticsearch/ElasticsearchResultIterator.scala#L103) needs to implement deserialize(). We are...
#### Issue Description We are implementing unit tests to test general functionalities of Sparkler and later our connector to Elasticsearch. We will populate details as we write the tests.
#### Issue Description Please describe our issue, along with: - expected behavior - encountered behavior The crawler crashes unexpectedly after a while, claiming that resource limits have been reached. ####...