polars Polars Taking too long to write parquet (using gzip compression)

trafficstars

What language are you using?

Python

Have you tried latest version of polars?

[yes]

What version of polars are you using?

polars 0.13.59

What operating system are you using polars on?

Ubuntu on an EC2 instance Linux - 5.15.0-1015-aws

What language version are you using

Python 3.9

Describe your bug.

I've experienced a longer time than usual to write DataFrames to parquet. I was using gzip engine to compression, I've changed to lz4 and got a quick fix but the gzip compression is taking a lot longer than it took before.

What are the steps to reproduce the behavior?

import polars as pl

my_dataframe = pl.DataFrame(...)

my_dataframe.write_parquet(nome_extensao, compression="gzip")

What is the actual behavior?

It took 394 seconds (almost six minutes) to write as parquet an 14GB dataset.

What is the expected behavior?

On the early versions of polars I was used to see this operation take only a tenth of the time above.

Aug 04 '22 15:08 igmriegel

And currently with lz4?

Aug 04 '22 15:08 ritchie46

@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.

Aug 04 '22 15:08 ritchie46

@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.

It is a really large dataset but I'll try to provide at least the schema soon.

Aug 04 '22 15:08 igmriegel

@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.

It is a really large dataset but I'll try to provide at least the schema soon.

Yeah, I was thinking of a function that takes n_rows as argument.

Aug 04 '22 16:08 ritchie46

And currently with lz4?

With lzip it performs the operation in 30 seconds or less.

Aug 04 '22 18:08 igmriegel

I see a similar issue with gzip compression. The same dataframe (33 million rows, 13 columns) takes 56.9 seconds to write to parquet using gzip compression but 6.57s using lz4. Of the remaining compression options, none of those that worked took more than 12.1 seconds. Note that 'lzo' is suggested as a valid option but reports that it is not supported when used.

Sep 09 '22 01:09 tikkanz

@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.

Sorry for taking too long to reply this.

Below, you can see the schema of the data that I've mentioned. Something I didn't notice before is that my dataset had a lot of NaNs I don't know if this relevant for this issue.

data_schema = {
	'column_1': polars.datatypes.Float64,
	'column_2': polars.datatypes.Float64,
	'column_3': polars.datatypes.Float64,
	'column_4': polars.datatypes.Float64,
	'column_5': polars.datatypes.Float64,
	'column_6': polars.datatypes.Float64,
	'column_7': polars.datatypes.Float64,
	'column_8': polars.datatypes.Float64,
	'column_9': polars.datatypes.Float64,
	'column_10': polars.datatypes.Float64,
	'column_11': polars.datatypes.Categorical,
	'column_12': polars.datatypes.Categorical,
	'column_13': polars.datatypes.Categorical,
	'column_14': polars.datatypes.Categorical,
	'column_15': polars.datatypes.Utf8,
	'column_16': polars.datatypes.Utf8,
	'column_17': polars.datatypes.Categorical,
	'column_18': datetime[ns],
	'column_19': polars.datatypes.Float64,
	'column_20': polars.datatypes.Float64,
	'column_21': polars.datatypes.Float64,
	'column_22': polars.datatypes.Float64,
	'column_23': polars.datatypes.Float64,
	'column_24': polars.datatypes.Float64,
	'column_25': polars.datatypes.Float64,
	'column_26': polars.datatypes.Float64,
	'column_27': polars.datatypes.Float64,
	'column_28': polars.datatypes.Float64,
	'column_29': polars.datatypes.Float64,
	'column_30': polars.datatypes.Float64,
	'column_31': polars.datatypes.Float64,
	'column_32': polars.datatypes.Float64,
	'column_33': polars.datatypes.Float64,
	'column_34': polars.datatypes.Float64,
	'column_35': polars.datatypes.Int64,
	'column_36': polars.datatypes.Float64,
	'column_37': polars.datatypes.Float64,
	'column_38': polars.datatypes.Float64,
	'column_39': polars.datatypes.Float64,
	'column_40': polars.datatypes.Float64,
	'column_41': polars.datatypes.Float64,
	'column_42': polars.datatypes.Float64,
	'column_43': polars.datatypes.Float64,
	'column_44': polars.datatypes.Utf8,
	'column_45': polars.datatypes.Utf8,
	'column_46': polars.datatypes.Utf8,
	'column_47': polars.datatypes.Utf8,
	'column_48': polars.datatypes.Utf8,
	'column_49': polars.datatypes.Utf8,
	'column_50': polars.datatypes.Utf8,
	'column_51': polars.datatypes.Utf8
}

Sep 09 '22 14:09 igmriegel

My data also has significant null content. Testing showed that, if anything, it took longer to write to parquet with compression after filling nulls with 0.

lz4 6 -> 9.5 secs
gzip 67 -> 91 secs

The schema of my DataFrame is:

{'column_1': polars.datatypes.Int32,
 'column_2': polars.datatypes.Int32,
 'column_3': polars.datatypes.Int32,
 'column_4': polars.datatypes.Float64,
 'column_5': polars.datatypes.Date,
 'column_6': polars.datatypes.Int32,
 'column_7': polars.datatypes.Int32,
 'column_8': polars.datatypes.Utf8,
 'column_9': polars.datatypes.Float32,
 'column_10': polars.datatypes.Float32,
 'column_11': polars.datatypes.Float32,
 'column_12': polars.datatypes.Float32,
 'column_13': polars.datatypes.Float32,
 'column_14': polars.datatypes.Float32}

Sep 10 '22 02:09 tikkanz

Could it be that it is the performance of gzip on large blocks? Does changing the block size help?

Oct 21 '22 18:10 ritchie46

Closing due to the lack of reproducible example. Please open a new bug report if this is still an issue.

Jan 16 '24 13:01 stinodego

polars polars copied to clipboard

Polars Taking too long to write parquet (using gzip compression)

What language are you using?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

polars
polars copied to clipboard