cdxj-indexer icon indicating copy to clipboard operation
cdxj-indexer copied to clipboard

CDX files generated are not sorted

Open thomaspreece opened this issue 7 years ago • 3 comments

Similar to the wayback indexer, this indexer doesn't produce a sorted CDX file so when you try to use it on pywb it fails to find links correctly. Just wondering whether there was a particular design decision that was taken for why it works this way?

I should add that I am only looking at using CDX files. This is because I want to test out pywb and openwayback and as far as I can find out (from docs/code), openwayback 2.3.2 doesn't support CDXJ. I found some mention of CDXJ and openwayback in reference to openwayback 3.0.0 but as it is a stale branch on github I assume it has been abandoned.

thomaspreece avatar Oct 20 '17 10:10 thomaspreece

This package is still in development and just haven't had a chance to add sorting yet.

Perhaps it should be the default option, or via -s flag (consistent with the cdx-indexer in pywb).

Of course, you can also just pipe the index through a cmdline sort tool, `cdxj-indexer | sort > file.cdx

Or, you could use the cdx-indexer in pywb actually, it defaults to regular CDX. Basically, this package is an effort to split that functionality into its own package in a cleaner way, but haven't had a chance to make as much progress on it.

ikreymer avatar Oct 21 '17 01:10 ikreymer

Ah, didn't realise pywb cdx-indexer would run separate to pywb. Doing some testing I'm finding that in all cases your cdx(j) indexers are significantly faster then the openwayback versions, good job! :)

thomaspreece avatar Oct 23 '17 08:10 thomaspreece

Several years later, coming back to this project... this will be fixed in the 1.1.0 release, finally :)

ikreymer avatar Aug 12 '20 00:08 ikreymer