cyavro icon indicating copy to clipboard operation
cyavro copied to clipboard

Assign into preallocated array

Open martindurant opened this issue 8 years ago • 0 comments

The current method is to create numpy arrays (or lists) for a given chunk of a given block.

The creation of a pandas dataframe from numpy arrays, and the concatenation of dataframes are memory and time inefficient. Would be much better to allocate a dataframe up front as is done in fastparquet and assign into it. The dtypes come from the parsed global file header. Any nested fields would be Object type (although non-repeated structures could be flattened, also implemented in fastparquet).

An avro block states how many records and bytes it has at the head.

martindurant avatar Aug 23 '17 16:08 martindurant