polars
polars copied to clipboard
CSV: build categoricals directly
Describe your feature request
Tryout different ways to create Categoricals.
Currently creating categoricals can be very slow.
import pyarrow as pa
import polars as pl
def read_fragments_from_file(
fragments_bed_filename: str, engine: Union[str, Literal["polars"], Literal["pyarrow"]] = "polars"
) -> pl.DataFrame:
bed_column_names = (
"Chromosome",
"Start",
"End",
"Name",
"Score",
)
# Set the correct open function depending if the fragments BED file is gzip compressed or not.
open_fn = gzip.open if fragments_bed_filename.endswith(".gz") else open
skip_rows = 0
nbr_columns = 5
if engine == "polars":
import polars as pl
# Read fragments BED file with polars.
df = (
pl.read_csv(
fragments_bed_filename,
has_headers=False,
skip_rows=skip_rows,
sep="\t",
use_pyarrow=False,
new_columns=bed_column_names[:nbr_columns],
dtypes={
"Chromosome": pl.Categorical,
"Start": pl.Int32,
"End": pl.Int32,
"Name": pl.Categorical,
"Strand": pl.Categorical,
},
)
)
elif engine == "pyarrow":
# Read fragments BED file with pyarrow, and create categoricals with pyarrow and convert them to polars dataframe.
df = pl.from_arrow(
pa.csv.read_csv(
fragments_bed_filename,
read_options=pa.csv.ReadOptions(
use_threads=True,
skip_rows=0,
column_names=bed_column_names[:nbr_columns],
),
parse_options= pa.csv.ParseOptions(
delimiter="\t",
quote_char=False,
escape_char=False,
newlines_in_values=False,
),
convert_options= pa.csv.ConvertOptions(
column_types={
"Chromosome": pa.dictionary(pa.int32(), pa.large_string()),
"Start": pa.int32(),
"End": pa.int32(),
"Name": pa.dictionary(pa.int32(), pa.large_string()),
"Strand": pa.dictionary(pa.int32(), pa.large_string()),
},
)
)
)
return df
Creating catergoricals directly in polars takes 6-7 seconds with polars, while creating them with pyarrow is almost instantaneously (and converting those dicttionaries to polars categoricals is very fast). Both dataframes give the same results as far as I can tell (physical values).
# Read TSV file with polars and create categoricals with polars.
In [8]: %time df_pl = read_fragments_from_file('test.tsv', engine='polars')
CPU times: user 52.6 s, sys: 10.2 s, total: 1min 2s
Wall time: 17.5 s
# Read TSV file with pyarrow, and create categoricals with pyarrow and convert them to polars dataframe.
In [9]: %time df_pa_pl = read_fragments_from_file('test.tsv', engine='pyarrow')
CPU times: user 1min, sys: 9.78 s, total: 1min 10s
Wall time: 11.5 s
It would be worth a try to create categoricals in a similar way as pyarrow and provide methods to create a global string cache from local string caches.
categorical_one = pl.Series("categorical_one", ["c", "a", "b", "c", "b"], dtype=pl.Categorical)
categorical_two = pl.Series("categorical_two", ["b", "e", "a", "b", "c"], dtype=pl.Categorical),
categorical_three = pl.Series("categorical_three", ["e", "c", "b", "e", "a"], dtype=pl.Categorical),
Create all local string caches in parallel:
- mappings from: int(categorical index) to string
local string cache "categorical_one"
[0]: c: 0
[1]: a: 1
[2]: b: 2
local string cache "categorical_two"
[0]: b: 0
[1]: e: 1
[2]: a: 2
[3]: c: 3
local string cache "categorical_three"
[0]: e: 0
[1]: c: 1
[2]: b: 2
[3]:a: 3
# Create global string cache from local ones:
# - append all local string caches and add offset for each local string cache
global string cache:
c: 0
a: 1
b: 2
# Add size of categorical_one to array offsets of categorical_two.
b: 0 + 3 = 3
e: 1 + 3 = 4
a: 2 + 3 = 5
c: 3 + 3 = 6
# Add size of categorical_one and categorical_two to array offsets of categorical_three.
e: 0 + 3 + 4 = 7
c: 1 + 3 + 4 = 8
b: 2 + 3 + 4 = 9
a: 3 + 3 + 4 = 10
# Store in an array:
[0]: c: 0
[1]: a: 1
[2]: b: 2
[3]: b: 3
[4]: e: 4
[5]: a: 5
[6]: c: 6
[7]: e: 7
[8]: c: 8
[9]: b: 9
[10]: a: 10
# Get lowest index for each string so you have a:
# old index to new index mapping.
[0]: c: 0
[1]: a: 1
[2]: b: 2
[3]: b: 2
[4]: e: 4
[5]: a: 1
[6]: c: 0
[7]: e: 4
[8]: c: 0
[9]: b: 2
[10]: a: 1
# Create global string cache with int to string mappings and reverse from old index to new index mapping array.
c: 0
a: 1
b: 2
e: 4
Remap all physical categorical columns with the old index to new index mapping array to update the integers
[0]: 0
[1]: 1
[2]: 2
[3]: 2
[4]: 4
[5]: 1
[6]: 0
[7]: 4
[8]: 0
[9]: 2
[10]: 1
@ghuls have you got on an example file or some synthetic data where you benchmarked this on? I am curious if you are talking about a local or a global string cache. The local string cache has had performance improvements lately. The global has a lot of contention and I agree we should try to improve that.
See: #4078
Here you have an example file in the same format:
curl https://temp.aertslab.org/.tsv/atac_fragments.head40000000.tsv.gz | zcat | tail -n +52 > fragments.head40000000.tsv
The test above was with local string cache.
Below the timings with local and global string cache (larger file that the example file).
# Read file with polars (pl.Utf8, no pl.Categorical):
In [53]: %time df_pl_strings = read_fragments_from_file_as_strings(fragments_bed_filename, engine='polars')
CPU times: user 41.8 s, sys: 7.7 s, total: 49.5 s
Wall time: 6.71 s
# Read file with pyarrow (pa.large_string, no pa.dictionary) and convert to polars dataframe:
In [54]: %time df_pa_pl_strings = read_fragments_from_file_as_strings(fragments_bed_filename, engine='pyarrow')
CPU times: user 39.4 s, sys: 10.4 s, total: 49.8 s
Wall time: 8.54 s
# Read file with polars (pl.Utf8 columns as pl.Categorical) with local string cache:
In [8]: %time df_pl = read_fragments_from_file(fragments_bed_filename, engine='polars')
CPU times: user 51.9 s, sys: 10.1 s, total: 1min 2s
Wall time: 17.4 s
# Read file with polars (pl.Utf8 columns as pl.Categorical) with global string cache:
In [9]: %%time
...: with pl.StringCache():
...: df_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='polars')
...:
CPU times: user 56 s, sys: 8.64 s, total: 1min 4s
Wall time: 22.7 s
# Pyarrow dictionaries to polars Categorical with local string cache:
In [10]: %time df_pa_pl = read_fragments_from_file(fragments_bed_filename engine='pyarrow')
CPU times: user 1min 1s, sys: 9.84 s, total: 1min 11s
Wall time: 11.7 s
# Pyarrow dictionaries to polars Categorical with global string cache:
In [11]: %%time
...: with pl.StringCache():
...: df_pa_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
...:
CPU times: user 1min 17s, sys: 10.1 s, total: 1min 27s
Wall time: 27.7 s
Could it be that pyarrow convert the categorical upon reading? Whereas we first read as string column and then convert.
It looks like it might indeed create the categorical upon reading.
Creating a dictionary with pyarrow from the pl.Utf8 columns takes close to 9 seconds.
In [92]: %time chrom_pa_dict = df_pl_strings.get_column("Chromosome").to_arrow().dictionary_encode()
CPU times: user 2.08 s, sys: 128 ms, total: 2.2 s
Wall time: 2.2 s
In [93]: %time name_pa_dict = df_pl_strings.get_column("Name").to_arrow().dictionary_encode()
CPU times: user 6.53 s, sys: 167 ms, total: 6.7 s
Wall time: 6.7 s
Yeap, that makes more sense to me as the local builders seem pretty optimized.
Still quite surprised it takes so long to create categoricals, basically longer than reading the whole file.
Yes, you need to hash those strings and store them in a hashmap. That's expensive.
>>> %%time
>>> df["pickup_datetime"].cast(pl.Categorical)
CPU times: user 1.91 s, sys: 162 ms, total: 2.08 s
Wall time: 2.06 s
>>> %%time
>>> df["pickup_datetime"].to_arrow().dictionary_encode()
CPU times: user 2.08 s, sys: 163 ms, total: 2.24 s
Wall time: 2.23 s
Maybe we should allow a Categorical builder in the csv reader...
Global string cache is way faster now for the case above (after #4087):
In [4]: %time df_pl = read_fragments_from_file(fragments_bed_filename, engine='polars')
CPU times: user 53.7 s, sys: 8.24 s, total: 1min 1s
Wall time: 17.2 s
In [3]: %time df_pa_pl = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
CPU times: user 1min, sys: 9.16 s, total: 1min 9s
Wall time: 11.3 s
In [5]: %%time
...: with pl.StringCache():
...: df_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='polars')
...:
CPU times: user 53.2 s, sys: 8.33 s, total: 1min 1s
Wall time: 17.2 s
In [6]: %%time
...: with pl.StringCache():
...: df_pa_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
...:
CPU times: user 1min 9s, sys: 9.69 s, total: 1min 19s
Wall time: 21 s
Was closed by the wrong PR.
Global string cache is way faster now for the case above (after https://github.com/pola-rs/polars/pull/4087):
Wow there is almost no overhead of the global string cache in your use case! :tada:
I am positive the difference is now only due to creating the categoricals upon csv reading.
Wow there is almost no overhead of the global string cache in your use case! tada
Thanks for the great improvement. It is getting to acceptable levels :-P. A fast global string cache is great for my case. Although it might not strictly need a global string case for the code that I have in mind.
As the "Name" column needs to be a categorical and at a certain point this column gets filtered by a series with a few 1000 elements. So the possibility for creating a categorical series with the same "local" cache as the "Name" column would be good enough.
Building categoricals in CSV reader was implemented in: https://github.com/pola-rs/polars/pull/4933