When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns

Open learningkeeda opened this issue 1 year ago • 0 comments

Apache Iceberg version

0.7.0

Please describe the bug 🐞

When I try to load large dataset as below, string columns data is stored in its original, uncompressed form, leading to increased memory usage which leads to out-of-memory errors.

from pyiceberg.catalog import load_catalog from pyiceberg.expressions import GreaterThanOrEqual

catalog = load_catalog("default") table = catalog.load_table("test.table1") table.scan( selected_fields=("id", "string_cols1", "string_cols2"), ).to_arrow()

To avoid these issues, it's best to use dictionary encoding, especially for columns with low cardinality, where values are repeated frequently.

Sep 25 '24 03:09 learningkeeda