hail
hail copied to clipboard
[compiler] make readers support uid generation
High level changes:
-
TableRead
andMatrixRead
text representation change: where before the requested type could beNone
, it can now also beDropRowUIDs
, or forMatrixRead
DropColUIDs
orDropRowColUIDs
. That way in the common case of not needing the read to produce uids, we don't need to pollute the printed IR with large types. -
hl.read_table
gets an option_create_row_uids
, to allow for testing uids in python, and similarly forhl.read_matrix_table
- There are globally fixed default field names
TableReader.uidFieldName
,MatrixReader.rowUIDFieldName
, andMatrixReader.colUIDFieldName
. The full type of anyTableReader
/MatrixReader
must contain these fields. If a consumer doesn't want uids, it just doesn't include them in the requested type. If it wants different field names, it must use aTableRename
/MatrixRename
node. This design ensures that the field pruner doesn't need any awareness of uids.- An exception to this rule is if the written data already contains any of these special fields, in which case they are just read as usual. This ensures that a write/read in the middle of a pipeline can't change uid fields. We're making the assumption that these reserved field names are never used in user data, so if written data contains one of these fields, it must have been created by us, and so has the correct uid semantics. (Note that this was a late change, and I may have missed converting some readers to handle this case.)
- The uids fields always come last in the row/col struct. Note that this requires some care when lowering MatrixTable, to make sure the row uid field comes after the entries field.
-
PartitionReader
s, on the other hand, must specify the name of their uid field. If this field is in the requested type, it will always be generated by the reader, even if the field already existed in the written data. It is now the responsibility of the consumer to choose the uid field name so as not to clobber an existing field. - Added a trait
CountedIterator
, for iterators which keep track of a row index or file offset. The methodgetCurIdx
should be called afternext()
, to get the corresponding index. This avoids having to allocate tuples.
@tpoterba I'm leaving the WIP label on this for now, because I'd like to make a careful pass over everything now that it's done, to make sure late bugfixes are handled consistently throughout. But tests are passing, and should be ready for you to start digging in.