ArchiveSpark
ArchiveSpark copied to clipboard
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
warcio raises `warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response` at the second WARC record (after the warc-info record) in a WARC file written with ArchiveSpark. Both state that they...
Running ArchiveSpark from docker. Enrich function is not adding any payload when peekJson is called. The payload in my warc files are in binary. Can it be the problem? If...
Hello! I've been using `ArchiveSpark` with the CommonCrawl files stored on S3. I found a few items that needed small fixes and I thought I'd send in a PR. I...
Hi, First of all thank for open sourcing this great tool. I have been trying to learn the syntax and start playing around with archives using the "Downloading_WARC_from_Wayback" notebook using...
Hello, is it possible to make a new build available on maven with more recent scala/spark versions? The versions last used in 2019 (scala 2.11) are no longer compatible with...