databend icon indicating copy to clipboard operation
databend copied to clipboard

optimize(fuse): record scalar column in meta file(or parquet meta)?.

Open youngsofun opened this issue 1 year ago • 3 comments

Summary

For a row in a large wide table, many (even most) columns may be null or set to their default values. This table might be loaded using SQL commands like COPY INTO wide_table(c1, c100) FROM ..., while wide_table itself may contain 1000 columns.

In memory, the unused columns are represented as Value::Scalar in DataBlock, which speeds up computation significantly. However, when we translate DataBlock into an Arrow RowBatch, it gets flattened. This results in:

  1. Slower load progress.
  2. When we read the data back, it is represented as Value::Column.

Impact

  • The flattening process during the conversion to Arrow RowBatch introduces performance overhead, causing slower load times.
  • The conversion of unused columns from Value::Scalar to Value::Column during read-back operations can negatively impact performance and resource usage.

youngsofun avatar Jun 27 '24 07:06 youngsofun

cc @dantengsky @zhyass

youngsofun avatar Jun 28 '24 03:06 youngsofun

For a row in a large wide table, many (even most) columns may be null or set to their default values. This table might be loaded using SQL commands like COPY INTO wide_table(c1, c100) FROM ...,

It looks like 'alter table t add column c int' or 'alter table t add column c int default 1',

maybe we need not to "materialize" those columns at all?

dantengsky avatar Jun 28 '24 04:06 dantengsky

yes

youngsofun avatar Jun 28 '24 06:06 youngsofun