cyavro
cyavro copied to clipboard
Assign into preallocated array
The current method is to create numpy arrays (or lists) for a given chunk of a given block.
The creation of a pandas dataframe from numpy arrays, and the concatenation of dataframes are memory and time inefficient. Would be much better to allocate a dataframe up front as is done in fastparquet and assign into it. The dtypes come from the parsed global file header. Any nested fields would be Object type (although non-repeated structures could be flattened, also implemented in fastparquet).
An avro block states how many records and bytes it has at the head.