polars
polars copied to clipboard
Polars Taking too long to write parquet (using gzip compression)
What language are you using?
Python
Have you tried latest version of polars?
- [yes]
What version of polars are you using?
polars 0.13.59
What operating system are you using polars on?
Ubuntu on an EC2 instance Linux - 5.15.0-1015-aws
What language version are you using
Python 3.9
Describe your bug.
I've experienced a longer time than usual to write DataFrames to parquet. I was using gzip engine to compression, I've changed to lz4 and got a quick fix but the gzip compression is taking a lot longer than it took before.
What are the steps to reproduce the behavior?
import polars as pl
my_dataframe = pl.DataFrame(...)
my_dataframe.write_parquet(nome_extensao, compression="gzip")
What is the actual behavior?
It took 394 seconds (almost six minutes) to write as parquet an 14GB dataset.
What is the expected behavior?
On the early versions of polars I was used to see this operation take only a tenth of the time above.
And currently with lz4?
@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.
@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.
It is a really large dataset but I'll try to provide at least the schema soon.
@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.
It is a really large dataset but I'll try to provide at least the schema soon.
Yeah, I was thinking of a function that takes n_rows as argument.
And currently with lz4?
With lzip it performs the operation in 30 seconds or less.
I see a similar issue with gzip compression. The same dataframe (33 million rows, 13 columns) takes 56.9 seconds to write to parquet using gzip compression but 6.57s using lz4. Of the remaining compression options, none of those that worked took more than 12.1 seconds. Note that 'lzo' is suggested as a valid option but reports that it is not supported when used.
@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.
Sorry for taking too long to reply this.
Below, you can see the schema of the data that I've mentioned. Something I didn't notice before is that my dataset had a lot of NaNs I don't know if this relevant for this issue.
data_schema = {
'column_1': polars.datatypes.Float64,
'column_2': polars.datatypes.Float64,
'column_3': polars.datatypes.Float64,
'column_4': polars.datatypes.Float64,
'column_5': polars.datatypes.Float64,
'column_6': polars.datatypes.Float64,
'column_7': polars.datatypes.Float64,
'column_8': polars.datatypes.Float64,
'column_9': polars.datatypes.Float64,
'column_10': polars.datatypes.Float64,
'column_11': polars.datatypes.Categorical,
'column_12': polars.datatypes.Categorical,
'column_13': polars.datatypes.Categorical,
'column_14': polars.datatypes.Categorical,
'column_15': polars.datatypes.Utf8,
'column_16': polars.datatypes.Utf8,
'column_17': polars.datatypes.Categorical,
'column_18': datetime[ns],
'column_19': polars.datatypes.Float64,
'column_20': polars.datatypes.Float64,
'column_21': polars.datatypes.Float64,
'column_22': polars.datatypes.Float64,
'column_23': polars.datatypes.Float64,
'column_24': polars.datatypes.Float64,
'column_25': polars.datatypes.Float64,
'column_26': polars.datatypes.Float64,
'column_27': polars.datatypes.Float64,
'column_28': polars.datatypes.Float64,
'column_29': polars.datatypes.Float64,
'column_30': polars.datatypes.Float64,
'column_31': polars.datatypes.Float64,
'column_32': polars.datatypes.Float64,
'column_33': polars.datatypes.Float64,
'column_34': polars.datatypes.Float64,
'column_35': polars.datatypes.Int64,
'column_36': polars.datatypes.Float64,
'column_37': polars.datatypes.Float64,
'column_38': polars.datatypes.Float64,
'column_39': polars.datatypes.Float64,
'column_40': polars.datatypes.Float64,
'column_41': polars.datatypes.Float64,
'column_42': polars.datatypes.Float64,
'column_43': polars.datatypes.Float64,
'column_44': polars.datatypes.Utf8,
'column_45': polars.datatypes.Utf8,
'column_46': polars.datatypes.Utf8,
'column_47': polars.datatypes.Utf8,
'column_48': polars.datatypes.Utf8,
'column_49': polars.datatypes.Utf8,
'column_50': polars.datatypes.Utf8,
'column_51': polars.datatypes.Utf8
}
My data also has significant null content. Testing showed that, if anything, it took longer to write to parquet with compression after filling nulls with 0.
lz46 -> 9.5 secsgzip67 -> 91 secs
The schema of my DataFrame is:
{'column_1': polars.datatypes.Int32,
'column_2': polars.datatypes.Int32,
'column_3': polars.datatypes.Int32,
'column_4': polars.datatypes.Float64,
'column_5': polars.datatypes.Date,
'column_6': polars.datatypes.Int32,
'column_7': polars.datatypes.Int32,
'column_8': polars.datatypes.Utf8,
'column_9': polars.datatypes.Float32,
'column_10': polars.datatypes.Float32,
'column_11': polars.datatypes.Float32,
'column_12': polars.datatypes.Float32,
'column_13': polars.datatypes.Float32,
'column_14': polars.datatypes.Float32}
Could it be that it is the performance of gzip on large blocks? Does changing the block size help?
Closing due to the lack of reproducible example. Please open a new bug report if this is still an issue.