tfrecord icon indicating copy to clipboard operation
tfrecord copied to clipboard

reader now supports .send() syntax for specifying index

Open relh opened this issue 1 year ago • 2 comments

reader now supports .send() syntax for specifying index, MapDataset sets up iterator and uses it with .send()

retry of https://github.com/vahidk/tfrecord/pull/96

I added .send() functionality to reader to let it seek to somewhere in the index before returning a value. I added a TFRecordMapDataset that creates the generate and calls an initial next() on it to get one value and then is capable of setting indices.

Dataset requires a .tfindex has been built.

relh avatar Aug 19 '24 19:08 relh

I don't really like this because we have to call next() on the iterator once and basically throw away the first value before being able to index, but it's a pretty minimal change to reader. Maybe I can refactor things to fix this dummy call.

relh avatar Aug 19 '24 20:08 relh

I added a new flag map_access everywhere. Also the TFRecordMapDataset now has a try: block for the first next(iter), though I think we don't need it because I changed the yields to not call the pb2 parsing.

This is a lot more verbose than the original commit but its about 150% faster than the first commit for me. But I'm also loading ~7000 TFRecordMapDatasets and indexing randomly into them so dummy iter overhead is real.

relh avatar Aug 19 '24 21:08 relh

This was broken but now it actually can send to the inner iterator

relh avatar Sep 05 '24 18:09 relh

I don't believe the right strategy is to hack random access into the iterator, I think the right way is to refactor the code not to use an iterator but a class. Given the magnitude of the change I'll close this PR. I might take a stab at it if I get a chance. Note also random access for tfrecord is not encouraged. The whole point of this library is that it's fast because it does sequential access.

vahidk avatar Sep 05 '24 18:09 vahidk

Yeah, agreed, what I had suggested was really ugly in a round peg -> square hole kind of way.

I have a situation where I am creating pairs from samples that are offset from each other in the same record, maybe there's a more elegant queue like system I can just implement in my own code. Anyways, thanks for following up!

relh avatar Sep 05 '24 18:09 relh