databend icon indicating copy to clipboard operation
databend copied to clipboard

Improvement: abandon internal patches of parquet2

Open dantengsky opened this issue 2 years ago • 2 comments

Summary

We have two internal patches of parqeut2, which mainly address the requirement

  • acquire the parquet file meta, right after the parquet file has been written, without re-read the file

It works, but awkwardly: each time we sync with upstream(official parquet2), there are some extra works to do (rebase, resolve potential conflicts...)

Among the new features that parquet2 has introduced recently, the following two seem to be able to resolve the above requirement.

  • https://github.com/jorgecarleitao/parquet2/pull/147
  • https://github.com/jorgecarleitao/parquet2/pull/148

Thus,

  • we should replace our own internal patches using the new APIs that parquet2 exposes.
  • and pin the parquet2 cargo dependency to the rev of the official parquet2 commit

dantengsky avatar Jun 20 '22 05:06 dantengsky

Let's go upstream first!

Xuanwo avatar Jul 04 '22 14:07 Xuanwo

internal parquet2 patches are not totally abandoned yet ( for data format backward compatibility). after all the old data has been migrated, we should switch to the upstream parquet2,

dantengsky avatar Aug 01 '22 03:08 dantengsky

It's time to remove https://github.com/datafuselabs/parquet2 and https://github.com/datafuse-extras/parquet2 ?

BohuTANG avatar Sep 06 '22 00:09 BohuTANG