dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[QST] Which file format should I usually for efficient schema evolution, Parquet or Avro?

Open PeterLappo opened this issue 3 years ago • 2 comments

Which file format should I usually for efficient schema evolution, Parquet or Avro?

PeterLappo avatar May 15 '22 07:05 PeterLappo

@PeterLappo Do you have more information or details on what kind of schema evolution you're looking for. In general dask (and by extension dask-sql) expects all partitions in a dataframe to have a similar schema and in cases where a single dataset has slightly different schema for files within that dataset might not work with today.

Also tagging @rjzamora for viz.

ayushdg avatar May 19 '22 18:05 ayushdg

Well typically I would add columns rather than remove or change. With Parquet format I'd need to rebuild all previous files with the new columns with some default. Avro can provide more seamless evolution as the schema specifies the evolution so you don't need to rebuild old files as the schema is applied to the underlying data. However, Avro can have nested structures making it incompatible with tabular data. My question is really what file format provides the most efficient way to evolve a schema?

PeterLappo avatar May 19 '22 21:05 PeterLappo