Sawood Alam comments

Results 409 comments of


                                            Sawood Alam

Implement memory-efficient in-file binary search for CDXJ indexes

The in-file binary search I implemented for MementoMap is showing pretty good speed of about one millisecond per lookup on an average. This measurement was performed on an index with...

Implement memory-efficient in-file binary search for CDXJ indexes

The data I worked on is not CDXJ data that might be the expectation of IPWB's current code. While I have large CDXJ files (not the ones with locator key...

Implement memory-efficient in-file binary search for CDXJ indexes

* [Historical UK Web Archive CDX Index](http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/)

Implement memory-efficient in-file binary search for CDXJ indexes

Here are corresponding code blocks in MementoMap: https://github.com/oduwsdl/MementoMap/blob/555644c5e1eb56d3dbdf83ab2a18c1aab51845f7/mementomap/mementomap.py#L131-L169 https://github.com/oduwsdl/MementoMap/blob/555644c5e1eb56d3dbdf83ab2a18c1aab51845f7/main.py#L45-L60 https://github.com/oduwsdl/MementoMap/blob/555644c5e1eb56d3dbdf83ab2a18c1aab51845f7/main.py#L98-L101

Implement memory-efficient in-file binary search for CDXJ indexes

> To get realistic benchmarks, how about we generate some IPWB-style CDXJ TimeMaps? We don't necessarily need to generate dummy data to test how it would behave of large indexes....

Implement memory-efficient in-file binary search for CDXJ indexes

I just realized that a single binary lookup takes a fraction of a millisecond. In the case of MementoMap supplied URI is looked up many times iteratively with shortder and...

Implement memory-efficient in-file binary search for CDXJ indexes

Also, you should see a significant rise in memory consumption.

Implement memory-efficient in-file binary search for CDXJ indexes

In a generic file, you can't count number of lines without reading it all the way. However, if you are looking for a rough estimate, you can calculate average number...

Implement memory-efficient in-file binary search for CDXJ indexes

Thanks @ikreymer for the pointer to IndexedDB. I am sure we will find a good use-case for that. However, this ticket is about implementing binary search in CDXJ files on...

Implement memory-efficient in-file binary search for CDXJ indexes

Thanks @anatoly-scherbakov for your interest in the project, contributions, and thoughts. We really appreciate it. Yes, you are right about the index being a key/value store problem which has been...