ArchiveSpark issues

WARC files written in ArchiveSpark incompatible with warcio

warcio raises `warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response` at the second WARC record (after the warc-info record) in a WARC file written with ArchiveSpark. Both state that they...

parismic

Can ArchiveSpark read and process binary payload in warc files?

1

Running ArchiveSpark from docker. Enrich function is not adding any payload when peekJson is called. The payload in my warc files are in binary. Can it be the problem? If...

aysunakarsu

Small changes to support reading CommonCrawl files from S3

Hello! I've been using `ArchiveSpark` with the CommonCrawl files stored on S3. I found a few items that needed small fixes and I thought I'd send in a PR. I...

mmisiewicz

Unknown connection error when downloading from wayback

Hi, First of all thank for open sourcing this great tool. I have been trying to learn the syntax and start playing around with archives using the "Downloading_WARC_from_Wayback" notebook using...

thusithaC

Updated build for scala 2.12/spark 3.1.2+?

Hello, is it possible to make a new build available on maven with more recent scala/spark versions? The versions last used in 2019 (scala 2.11) are no longer compatible with...

lesleyodu

ArchiveSpark
ArchiveSpark copied to clipboard

Metadata

WARC files written in ArchiveSpark incompatible with warcio

Can ArchiveSpark read and process binary payload in warc files?

Small changes to support reading CommonCrawl files from S3

Unknown connection error when downloading from wayback

Updated build for scala 2.12/spark 3.1.2+?

← Metadata

Owner

Metadata

ArchiveSpark ArchiveSpark copied to clipboard

Metadata

WARC files written in ArchiveSpark incompatible with warcio

Can ArchiveSpark read and process binary payload in warc files?

Small changes to support reading CommonCrawl files from S3

Unknown connection error when downloading from wayback

Updated build for scala 2.12/spark 3.1.2+?

← Metadata

Owner

Metadata

ArchiveSpark
ArchiveSpark copied to clipboard