Ohad Raviv

Results 13 comments of Ohad Raviv

Hi, we just came across the need to run this package. Spark's native `BucketedRandomProjectionLSH` wasn't good enough for (mainly because of the bucket skew issue), and this library worked perfectly....

Well, in my company (PayPal), we work with our private accounts in the public GitHub space, so we would have permissions also in the future. Maybe you still know someone...

I was also surprised that it just worked, but what happens is just that the parent blocks are saved in memory, and every time you need to read the next...

good point. I actually thought about that and forgot to check. but it looks like we're good. the sorted-iterator only has one pointer to [Node next](https://github.com/paypal/dione/blob/b102569cad81c4bc4e735e6e558cf472e2bfd27f/dione-hadoop/src/main/java/com/paypal/dione/avro/hadoop/file/AvroBtreeFile.java#L223) and the Node object...

ok.. so after looking at `get(key)` , I saw that we can hop backwards there if we were calling get() multiple times. So I added assertion to allow only bigger...

yeah.. it turned out to be relatively easy to implement. this one is the preliminary to the cache to work now #72 . and I tested #72 locally against s3...

not sure, currently the only place it is used (caching) is in `joinWithIndex`. are you using it? I mainly want to verify that we're not getting OOM during index file...

@eyala / @shay1bz - if you want to look at it.. we get this error `org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt` when I run the test...

@eyala - any update here?

@eyala - can we close this issue?