ArchiveSpark
ArchiveSpark copied to clipboard
Can ArchiveSpark read and process binary payload in warc files?
Running ArchiveSpark from docker. Enrich function is not adding any payload when peekJson is called. The payload in my warc files are in binary. Can it be the problem? If it is, then is there a way to make ArchiveSpark work with binary payload warc files? Thanks
Better late than never. You can use the access method and use the inputstream