Mat Kelly

Results 550 comments of Mat Kelly

It looks like a `heritrixPath + 'jobs/'` directory does not exist by default. Unsure where Heritrix 3.4 is saving job information by default, but this may be the source of...

Adding the `jobs/` directory allows it to be populated when running a crawl with Heritrix 3.4.0. ...but index.cdx remains at 0 bytes.

It seems like the WAIL logic is tied to `.warc` files and does not index `.warc.gz` via the cdx-indexer. This should be a straightforward programmatic fix.

There still appears to be an issue with the CDXJ merging procedure in 1df85bcbadc67091c4a1d784d098aef71cfc7b92. The new CDXZ.GZ file is created but reset to 0 when merging with the existing index.cdx.

`allCDXesPath = config.wailPath + "/archiveIndexes/*.cdx"` is probably the culprit.

The current master 7027b9b with OpenWayback 2.4.0 seems to work fine with regard to indexing, so maybe let's hold off of integrating the latest Heritrix just yet until we figure...

Tried this again by pulling the latest release of Heritrix into the latest WAIL master (distrib package at https://github.com/internetarchive/heritrix3/releases/tag/3.4.0-20200518), started a crawl from the WAIL UI, and Heritrix never started....

@ldko noted on the IIPC Slack #heritrix channel that [Heritrix dropped support for Java 7](https://github.com/internetarchive/heritrix3/pull/276) (August 2019) and to try using Java v8-11. - [ ] Bundle Java 11 with...

In Java 11 there are some larger files in the JDK like Contents/Home/lib/modules (137.4MB) and Contents/Home/lib/src.zip (57.5 MB) that don't play well with git. Removing these for testing so the...

The issue-345-java11 branch was never pushed to GitHub. Files too big and when removed, segfault on Heritrix launch. Added the full jdk back into the WAIL source locally without pushing...