cdxj-indexer
cdxj-indexer copied to clipboard
CDX files generated are not sorted
Similar to the wayback indexer, this indexer doesn't produce a sorted CDX file so when you try to use it on pywb it fails to find links correctly. Just wondering whether there was a particular design decision that was taken for why it works this way?
I should add that I am only looking at using CDX files. This is because I want to test out pywb and openwayback and as far as I can find out (from docs/code), openwayback 2.3.2 doesn't support CDXJ. I found some mention of CDXJ and openwayback in reference to openwayback 3.0.0 but as it is a stale branch on github I assume it has been abandoned.
This package is still in development and just haven't had a chance to add sorting yet.
Perhaps it should be the default option, or via -s
flag (consistent with the cdx-indexer in pywb).
Of course, you can also just pipe the index through a cmdline sort tool, `cdxj-indexer
Or, you could use the cdx-indexer in pywb actually, it defaults to regular CDX. Basically, this package is an effort to split that functionality into its own package in a cleaner way, but haven't had a chance to make as much progress on it.
Ah, didn't realise pywb cdx-indexer would run separate to pywb. Doing some testing I'm finding that in all cases your cdx(j) indexers are significantly faster then the openwayback versions, good job! :)
Several years later, coming back to this project... this will be fixed in the 1.1.0 release, finally :)