fastparquet
fastparquet copied to clipboard
Record row group size (in memory) and re-use if available in `iter_row_group`
Hi,
Do you think we could record in the metadata the size of the row groups when recording them?
This data can be obtained when dataframe is split in row_group in write, thanks to df.memory_usage().sum().
More complex, it could even be stored for each column independently.
When iterating over the row groups to load back data, instead of doing so one by one, the idea would be then to output them by set, with sets that are lower in memory than target size specified by the user. The subset of row groups is then predefined knowing their respective size.
pf.iter_row_groups(target_size='8MB') # for instance
The motivation for doing so in fastparquet is re-using to_pandas() capability of pre-defining df, then filling it with loaded row groups.
This feature can also be added through an outer layer, by using iter_row_groups(), storing dataframes in a list while checking their memory footprint, and when reaching target_size, concatenating in a single df all row groups having been loaded.
This second approach requires this concatenation step that is not in the 1st approach (hence more CPU intensive).
Could there be an interest in this feature? I obviously have, but if none from fastparquet team, should I then favor the 2nd approach?
Thanks for your feedback. Bests,
There are two places where arbitrary metadata can be stored: in each column chunk, or in the dataset global top level. Since columns can be selectively loaded, only the former makes sense. In principle, storing the in-memory footprint, as determined by pandas, does not seem like a bad idea. Of course, no other framework does this, so there is no template for it, and the optimisations you have in mind would have no effect for data coming from anything but fastparquet. Given this, I would say I am mildly positive on the idea.
A couple of notes
- the column chunks already say how big their buffer is after decompression (ColumnMetaData.total_uncompressed_size )
- we also know for a given dtype and number of rows, how much memory it should take (e.g., 8 bytes per timestamp;
dtype.itemsize) - NULL cells in a pandas nullable column still takes the same memory as a valid cell
- strings are the the odd one out, since the array contains (8-byte) pointers, but we can already know their size too, since nrows*4 of the decompressed buffer encode the lengths: all we need to know is the extra memory the python string object needs.
Thanks Martin for the feedback.
Of course, no other framework does this, so there is no template for it, and the optimisations you have in mind would have no effect for data coming from anything but fastparquet. Given this, I would say I am mildly positive on the idea.
This complexifies things but when the info is unavailable, we could have the 2nd approach as a fall back mode:
- either trying to guess from one of the note you added. (I may not be understanding clearly, but it seems ColumnMetaData.total_uncompressed_size is already what we should use, shouldn't we?)
- if no information on size is provided, the
iter_row_groupcould embed the 2nd approach I am indicating above: storing dataframes in a list, reading their size as they are loaded, and deriving a mean size per row. If total size is still below target size with loaded dataframes, a new row group is only loaded if its number of row x size estimation per row is still below target size. If not, the concatenated list is yielded.
This 2nd approach is not as performant of the 1st (because of this concatenation) but makes the feature available for parquet files coming from other frameworks.
PS: I will not work right away on this feature, but I am enquiring
ColumnMetaData.total_uncompressed_size is already what we should use
Not quite. That value includes the size of headers (if not V2 pages) and might still be encoded with RLE, DELTA, etc. For strings we need it, but for everything else, we should know the size from the dtype and nrows alone.
deriving a mean size per row
This is a bad idea - some dtypes have an exact, known size, so no need to measure. Others (string) have variable size, and an average might not be a good indicator. For categories, the total size of the codes might be smaller than the categories list itself.
So to summarise:
- having a function to return the assumed size per row-group would be useful and could be done right away. The function would need to take the same arguments as the read functions (column selection, nullables, categories). The choice of how to load row groups based on this information could be left to the external application.
- The in-memory size of each column chunk could be stored at write time. Since it's not immediately obvious that this value is useful, given the bullet above, I'm not sure I would recommend working on it.