iceberg-python
iceberg-python copied to clipboard
When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns
Apache Iceberg version
0.7.0
Please describe the bug 🐞
When I try to load large dataset as below, string columns data is stored in its original, uncompressed form, leading to increased memory usage which leads to out-of-memory errors.
from pyiceberg.catalog import load_catalog from pyiceberg.expressions import GreaterThanOrEqual
catalog = load_catalog("default") table = catalog.load_table("test.table1") table.scan( selected_fields=("id", "string_cols1", "string_cols2"), ).to_arrow()
To avoid these issues, it's best to use dictionary encoding, especially for columns with low cardinality, where values are repeated frequently.