Sawood Alam
Sawood Alam
The in-file binary search I implemented for MementoMap is showing pretty good speed of about one millisecond per lookup on an average. This measurement was performed on an index with...
The data I worked on is not CDXJ data that might be the expectation of IPWB's current code. While I have large CDXJ files (not the ones with locator key...
* [Historical UK Web Archive CDX Index](http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/)
Here are corresponding code blocks in MementoMap: https://github.com/oduwsdl/MementoMap/blob/555644c5e1eb56d3dbdf83ab2a18c1aab51845f7/mementomap/mementomap.py#L131-L169 https://github.com/oduwsdl/MementoMap/blob/555644c5e1eb56d3dbdf83ab2a18c1aab51845f7/main.py#L45-L60 https://github.com/oduwsdl/MementoMap/blob/555644c5e1eb56d3dbdf83ab2a18c1aab51845f7/main.py#L98-L101
> To get realistic benchmarks, how about we generate some IPWB-style CDXJ TimeMaps? We don't necessarily need to generate dummy data to test how it would behave of large indexes....
I just realized that a single binary lookup takes a fraction of a millisecond. In the case of MementoMap supplied URI is looked up many times iteratively with shortder and...
Also, you should see a significant rise in memory consumption.
In a generic file, you can't count number of lines without reading it all the way. However, if you are looking for a rough estimate, you can calculate average number...
Thanks @ikreymer for the pointer to IndexedDB. I am sure we will find a good use-case for that. However, this ticket is about implementing binary search in CDXJ files on...
Thanks @anatoly-scherbakov for your interest in the project, contributions, and thoughts. We really appreciate it. Yes, you are right about the index being a key/value store problem which has been...