hail icon indicating copy to clipboard operation
hail copied to clipboard

[compiler] make readers support uid generation

Open patrick-schultz opened this issue 2 years ago • 1 comments

High level changes:

  • TableRead and MatrixRead text representation change: where before the requested type could be None, it can now also be DropRowUIDs, or for MatrixRead DropColUIDs or DropRowColUIDs. That way in the common case of not needing the read to produce uids, we don't need to pollute the printed IR with large types.
  • hl.read_table gets an option _create_row_uids, to allow for testing uids in python, and similarly for hl.read_matrix_table
  • There are globally fixed default field names TableReader.uidFieldName, MatrixReader.rowUIDFieldName, and MatrixReader.colUIDFieldName. The full type of any TableReader/MatrixReader must contain these fields. If a consumer doesn't want uids, it just doesn't include them in the requested type. If it wants different field names, it must use a TableRename/MatrixRename node. This design ensures that the field pruner doesn't need any awareness of uids.
    • An exception to this rule is if the written data already contains any of these special fields, in which case they are just read as usual. This ensures that a write/read in the middle of a pipeline can't change uid fields. We're making the assumption that these reserved field names are never used in user data, so if written data contains one of these fields, it must have been created by us, and so has the correct uid semantics. (Note that this was a late change, and I may have missed converting some readers to handle this case.)
    • The uids fields always come last in the row/col struct. Note that this requires some care when lowering MatrixTable, to make sure the row uid field comes after the entries field.
  • PartitionReaders, on the other hand, must specify the name of their uid field. If this field is in the requested type, it will always be generated by the reader, even if the field already existed in the written data. It is now the responsibility of the consumer to choose the uid field name so as not to clobber an existing field.
  • Added a trait CountedIterator, for iterators which keep track of a row index or file offset. The method getCurIdx should be called after next(), to get the corresponding index. This avoids having to allocate tuples.

patrick-schultz avatar Jul 14 '22 19:07 patrick-schultz

@tpoterba I'm leaving the WIP label on this for now, because I'd like to make a careful pass over everything now that it's done, to make sure late bugfixes are handled consistently throughout. But tests are passing, and should be ready for you to start digging in.

patrick-schultz avatar Aug 10 '22 19:08 patrick-schultz