polars icon indicating copy to clipboard operation
polars copied to clipboard

Polars Taking too long to write parquet (using gzip compression)

Open igmriegel opened this issue 3 years ago • 9 comments
trafficstars

What language are you using?

Python

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

polars 0.13.59

What operating system are you using polars on?

Ubuntu on an EC2 instance Linux - 5.15.0-1015-aws

What language version are you using

Python 3.9

Describe your bug.

I've experienced a longer time than usual to write DataFrames to parquet. I was using gzip engine to compression, I've changed to lz4 and got a quick fix but the gzip compression is taking a lot longer than it took before.

What are the steps to reproduce the behavior?

import polars as pl

my_dataframe = pl.DataFrame(...)

my_dataframe.write_parquet(nome_extensao, compression="gzip")

What is the actual behavior?

It took 394 seconds (almost six minutes) to write as parquet an 14GB dataset.

What is the expected behavior?

On the early versions of polars I was used to see this operation take only a tenth of the time above.

igmriegel avatar Aug 04 '22 15:08 igmriegel

And currently with lz4?

ritchie46 avatar Aug 04 '22 15:08 ritchie46

@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.

ritchie46 avatar Aug 04 '22 15:08 ritchie46

@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.

It is a really large dataset but I'll try to provide at least the schema soon.

igmriegel avatar Aug 04 '22 15:08 igmriegel

@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.

It is a really large dataset but I'll try to provide at least the schema soon.

Yeah, I was thinking of a function that takes n_rows as argument.

ritchie46 avatar Aug 04 '22 16:08 ritchie46

And currently with lz4?

With lzip it performs the operation in 30 seconds or less.

igmriegel avatar Aug 04 '22 18:08 igmriegel

I see a similar issue with gzip compression. The same dataframe (33 million rows, 13 columns) takes 56.9 seconds to write to parquet using gzip compression but 6.57s using lz4. Of the remaining compression options, none of those that worked took more than 12.1 seconds. Note that 'lzo' is suggested as a valid option but reports that it is not supported when used.

tikkanz avatar Sep 09 '22 01:09 tikkanz

@igmriegel could you create a function that create the same schema of your data? I could do some benchmarking on it.

Sorry for taking too long to reply this.

Below, you can see the schema of the data that I've mentioned. Something I didn't notice before is that my dataset had a lot of NaNs I don't know if this relevant for this issue.

data_schema = {
	'column_1': polars.datatypes.Float64,
	'column_2': polars.datatypes.Float64,
	'column_3': polars.datatypes.Float64,
	'column_4': polars.datatypes.Float64,
	'column_5': polars.datatypes.Float64,
	'column_6': polars.datatypes.Float64,
	'column_7': polars.datatypes.Float64,
	'column_8': polars.datatypes.Float64,
	'column_9': polars.datatypes.Float64,
	'column_10': polars.datatypes.Float64,
	'column_11': polars.datatypes.Categorical,
	'column_12': polars.datatypes.Categorical,
	'column_13': polars.datatypes.Categorical,
	'column_14': polars.datatypes.Categorical,
	'column_15': polars.datatypes.Utf8,
	'column_16': polars.datatypes.Utf8,
	'column_17': polars.datatypes.Categorical,
	'column_18': datetime[ns],
	'column_19': polars.datatypes.Float64,
	'column_20': polars.datatypes.Float64,
	'column_21': polars.datatypes.Float64,
	'column_22': polars.datatypes.Float64,
	'column_23': polars.datatypes.Float64,
	'column_24': polars.datatypes.Float64,
	'column_25': polars.datatypes.Float64,
	'column_26': polars.datatypes.Float64,
	'column_27': polars.datatypes.Float64,
	'column_28': polars.datatypes.Float64,
	'column_29': polars.datatypes.Float64,
	'column_30': polars.datatypes.Float64,
	'column_31': polars.datatypes.Float64,
	'column_32': polars.datatypes.Float64,
	'column_33': polars.datatypes.Float64,
	'column_34': polars.datatypes.Float64,
	'column_35': polars.datatypes.Int64,
	'column_36': polars.datatypes.Float64,
	'column_37': polars.datatypes.Float64,
	'column_38': polars.datatypes.Float64,
	'column_39': polars.datatypes.Float64,
	'column_40': polars.datatypes.Float64,
	'column_41': polars.datatypes.Float64,
	'column_42': polars.datatypes.Float64,
	'column_43': polars.datatypes.Float64,
	'column_44': polars.datatypes.Utf8,
	'column_45': polars.datatypes.Utf8,
	'column_46': polars.datatypes.Utf8,
	'column_47': polars.datatypes.Utf8,
	'column_48': polars.datatypes.Utf8,
	'column_49': polars.datatypes.Utf8,
	'column_50': polars.datatypes.Utf8,
	'column_51': polars.datatypes.Utf8
}

igmriegel avatar Sep 09 '22 14:09 igmriegel

My data also has significant null content. Testing showed that, if anything, it took longer to write to parquet with compression after filling nulls with 0.

  • lz4 6 -> 9.5 secs
  • gzip 67 -> 91 secs

The schema of my DataFrame is:

{'column_1': polars.datatypes.Int32,
 'column_2': polars.datatypes.Int32,
 'column_3': polars.datatypes.Int32,
 'column_4': polars.datatypes.Float64,
 'column_5': polars.datatypes.Date,
 'column_6': polars.datatypes.Int32,
 'column_7': polars.datatypes.Int32,
 'column_8': polars.datatypes.Utf8,
 'column_9': polars.datatypes.Float32,
 'column_10': polars.datatypes.Float32,
 'column_11': polars.datatypes.Float32,
 'column_12': polars.datatypes.Float32,
 'column_13': polars.datatypes.Float32,
 'column_14': polars.datatypes.Float32}

tikkanz avatar Sep 10 '22 02:09 tikkanz

Could it be that it is the performance of gzip on large blocks? Does changing the block size help?

ritchie46 avatar Oct 21 '22 18:10 ritchie46

Closing due to the lack of reproducible example. Please open a new bug report if this is still an issue.

stinodego avatar Jan 16 '24 13:01 stinodego