traverseda
traverseda
``` ime cdx-indexer --sort ./0.cdx collections/wiki/archive/0.warc collections/wiki/archive/1.warc 7382.09user 33.70system 2:05:19elapsed 98%CPU (0avgtext+0avgdata 6278660maxresident)k ``` 40 minutes for one file, 120 minutes for two files.
If I get this fixed so it runs in linear-ish time on https://github.com/webrecorder/cdxj-indexer is it reasonable for pywb to depend on it? I think the only thing cdxj-indexer is missing...
I'd consider `O(n log(n))` to be linear-ish ;p Python's timsort can even do better than that, on average, somewhere between `O(n)` and `O(n log(n))`. Since my input is already semi-sorted,...
So after playing around with the much simplified implementation in https://github.com/webrecorder/cdxj-indexer I'm very confident that the problem exists in the underlying warcio library, not in the pywb wrapping layers.
I don't know, I'll have to look at it when I've got more time.
So the base warcio does not have this issue, but cdxj-indexer does, regardless of the use of `sort`.  
On copy-on-wire filesystems like btrfs it would be nice if we could use btrfs's reflinks.
This is a pretty big chunk of code. More than I've got the time to read through right now, especially with the "machine learning" factor. Can you give me a...
Having problems that look like this in https://github.com/traverseda/mycroft-skill-unitconversion Are you saying there should be guaranteed perfect matches on entities? I've also had it fail to return values that weren't optional....
One thing that I've done with sakura on my own projects is converted it to use a flexbox column layout instead of `margin:auto` based centering. The big advantage of this...