traverseda comments

Results 247 comments of


                                            traverseda

Faster indexing

``` ime cdx-indexer --sort ./0.cdx collections/wiki/archive/0.warc collections/wiki/archive/1.warc 7382.09user 33.70system 2:05:19elapsed 98%CPU (0avgtext+0avgdata 6278660maxresident)k ``` 40 minutes for one file, 120 minutes for two files.

Faster indexing

If I get this fixed so it runs in linear-ish time on https://github.com/webrecorder/cdxj-indexer is it reasonable for pywb to depend on it? I think the only thing cdxj-indexer is missing...

Faster indexing

I'd consider `O(n log(n))` to be linear-ish ;p Python's timsort can even do better than that, on average, somewhere between `O(n)` and `O(n log(n))`. Since my input is already semi-sorted,...

Faster indexing

So after playing around with the much simplified implementation in https://github.com/webrecorder/cdxj-indexer I'm very confident that the problem exists in the underlying warcio library, not in the pywb wrapping layers.

Faster indexing

I don't know, I'll have to look at it when I've got more time.

Faster indexing

So the base warcio does not have this issue, but cdxj-indexer does, regardless of the use of `sort`. ![warcio svg](https://user-images.githubusercontent.com/2125828/75477891-e168e000-5994-11ea-8515-f6850d00d735.png) ![cdxjindexer svg](https://user-images.githubusercontent.com/2125828/75478072-386eb500-5995-11ea-988d-38981faed9e6.png)

traverseda

Faster indexing

Faster indexing

Faster indexing

Faster indexing

Faster indexing

Faster indexing

do not copy files into archive

Support named-group entity matches

Entity matching more than it should

Roadmap or future plan