polars CSV: build categoricals directly

trafficstars

Describe your feature request

Tryout different ways to create Categoricals.

Currently creating categoricals can be very slow.

import pyarrow as pa
import polars as pl


def read_fragments_from_file(
    fragments_bed_filename: str, engine: Union[str, Literal["polars"], Literal["pyarrow"]] = "polars"
) -> pl.DataFrame:

    bed_column_names = (
        "Chromosome",
        "Start",
        "End",
        "Name",
        "Score",
    )

    # Set the correct open function depending if the fragments BED file is gzip compressed or not.
    open_fn = gzip.open if fragments_bed_filename.endswith(".gz") else open

    skip_rows = 0
    nbr_columns = 5

    if engine == "polars":
        import polars as pl

        # Read fragments BED file with polars.
        df = (
            pl.read_csv(
                fragments_bed_filename,
                has_headers=False,
                skip_rows=skip_rows,
                sep="\t",
                use_pyarrow=False,
                new_columns=bed_column_names[:nbr_columns],
                dtypes={
                    "Chromosome": pl.Categorical,
                    "Start": pl.Int32,
                    "End": pl.Int32,
                    "Name": pl.Categorical,
                    "Strand": pl.Categorical,
                },
            )
        )
    elif engine == "pyarrow":
        # Read fragments BED file with pyarrow, and create categoricals with pyarrow and convert them to polars dataframe.
        df = pl.from_arrow(
            pa.csv.read_csv(
                fragments_bed_filename,
                read_options=pa.csv.ReadOptions(
                    use_threads=True,
                    skip_rows=0,
                    column_names=bed_column_names[:nbr_columns],
                ),
                parse_options= pa.csv.ParseOptions(
                    delimiter="\t",
                    quote_char=False,
                    escape_char=False,
                    newlines_in_values=False,
                ),
                convert_options= pa.csv.ConvertOptions(
                    column_types={
                        "Chromosome": pa.dictionary(pa.int32(), pa.large_string()),
                        "Start": pa.int32(),
                        "End": pa.int32(),
                        "Name": pa.dictionary(pa.int32(), pa.large_string()),
                        "Strand": pa.dictionary(pa.int32(), pa.large_string()),
                    },
                )
            )
        )

    return df

Creating catergoricals directly in polars takes 6-7 seconds with polars, while creating them with pyarrow is almost instantaneously (and converting those dicttionaries to polars categoricals is very fast). Both dataframes give the same results as far as I can tell (physical values).

# Read TSV file with polars and create categoricals with polars.
In [8]: %time df_pl = read_fragments_from_file('test.tsv', engine='polars')
CPU times: user 52.6 s, sys: 10.2 s, total: 1min 2s
Wall time: 17.5 s

# Read TSV file with pyarrow, and create categoricals with pyarrow and convert them to polars dataframe.
In [9]: %time df_pa_pl = read_fragments_from_file('test.tsv', engine='pyarrow')
CPU times: user 1min, sys: 9.78 s, total: 1min 10s
Wall time: 11.5 s

It would be worth a try to create categoricals in a similar way as pyarrow and provide methods to create a global string cache from local string caches.

categorical_one = pl.Series("categorical_one", ["c", "a", "b", "c", "b"], dtype=pl.Categorical)
categorical_two = pl.Series("categorical_two", ["b", "e", "a", "b", "c"], dtype=pl.Categorical),
categorical_three = pl.Series("categorical_three", ["e", "c", "b", "e", "a"], dtype=pl.Categorical),

Create all local string caches in parallel:
  - mappings from: int(categorical index) to string 

local string cache "categorical_one"
[0]: c: 0
[1]: a: 1
[2]: b: 2

local string cache "categorical_two"
[0]: b: 0
[1]: e: 1
[2]: a: 2
[3]: c: 3

local string cache "categorical_three"
[0]: e: 0
[1]: c: 1
[2]: b: 2
[3]:a: 3

# Create global string cache from local ones:
#   - append all local string caches and add offset for each local string cache
global string cache:

c: 0
a: 1
b: 2

# Add size of categorical_one to array offsets of categorical_two.
b: 0 + 3 = 3
e: 1 + 3 = 4
a: 2 + 3 = 5
c: 3 + 3 = 6

# Add size of categorical_one and categorical_two to array offsets of categorical_three.
e: 0 + 3 +  4 = 7
c: 1 + 3 +  4 = 8
b: 2 + 3 +  4 = 9
a: 3 + 3 +  4 = 10

# Store in an array:
[0]: c: 0
[1]: a: 1
[2]: b: 2
[3]: b: 3
[4]: e: 4
[5]: a: 5
[6]: c: 6
[7]: e: 7
[8]: c: 8
[9]: b: 9
[10]: a: 10

# Get lowest index for each string so you have a:
# old index to new index mapping.
[0]: c: 0
[1]: a: 1
[2]: b: 2
[3]: b: 2
[4]: e: 4
[5]: a: 1
[6]: c: 0
[7]: e: 4
[8]: c: 0
[9]: b: 2
[10]: a: 1

# Create global string cache with int to string mappings and reverse from old index to new index mapping array.
c: 0
a: 1
b: 2
e: 4

Remap all physical categorical columns with the old index to new index mapping array to update the integers
[0]: 0
[1]: 1
[2]: 2
[3]: 2
[4]: 4
[5]: 1
[6]: 0
[7]: 4
[8]: 0
[9]: 2
[10]: 1

Jul 18 '22 17:07 ghuls

@ghuls have you got on an example file or some synthetic data where you benchmarked this on? I am curious if you are talking about a local or a global string cache. The local string cache has had performance improvements lately. The global has a lot of contention and I agree we should try to improve that.

Jul 19 '22 06:07 ritchie46

See: #4078

Jul 19 '22 09:07 ritchie46

Here you have an example file in the same format:

curl https://temp.aertslab.org/.tsv/atac_fragments.head40000000.tsv.gz | zcat | tail -n +52 > fragments.head40000000.tsv

The test above was with local string cache.

Below the timings with local and global string cache (larger file that the example file).

# Read file with polars (pl.Utf8, no pl.Categorical):
In [53]: %time df_pl_strings = read_fragments_from_file_as_strings(fragments_bed_filename, engine='polars')
CPU times: user 41.8 s, sys: 7.7 s, total: 49.5 s
Wall time: 6.71 s

# Read file with pyarrow (pa.large_string, no pa.dictionary) and convert to polars dataframe:
In [54]: %time df_pa_pl_strings = read_fragments_from_file_as_strings(fragments_bed_filename, engine='pyarrow')
CPU times: user 39.4 s, sys: 10.4 s, total: 49.8 s
Wall time: 8.54 s


# Read file with polars (pl.Utf8 columns as pl.Categorical) with local string cache:
In [8]: %time df_pl = read_fragments_from_file(fragments_bed_filename, engine='polars')
CPU times: user 51.9 s, sys: 10.1 s, total: 1min 2s
Wall time: 17.4 s

# Read file with polars (pl.Utf8 columns as pl.Categorical) with global string cache:
In [9]: %%time
   ...: with pl.StringCache():
   ...:     df_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='polars')
   ...: 
CPU times: user 56 s, sys: 8.64 s, total: 1min 4s
Wall time: 22.7 s


# Pyarrow dictionaries to polars Categorical with local string cache:
In [10]: %time df_pa_pl = read_fragments_from_file(fragments_bed_filename engine='pyarrow')
CPU times: user 1min 1s, sys: 9.84 s, total: 1min 11s
Wall time: 11.7 s

# Pyarrow dictionaries to polars Categorical with global string cache:
In [11]: %%time
    ...: with pl.StringCache():
    ...:     df_pa_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
    ...: 
CPU times: user 1min 17s, sys: 10.1 s, total: 1min 27s
Wall time: 27.7 s

Jul 19 '22 11:07 ghuls

Could it be that pyarrow convert the categorical upon reading? Whereas we first read as string column and then convert.

Jul 19 '22 11:07 ritchie46

It looks like it might indeed create the categorical upon reading.

Creating a dictionary with pyarrow from the pl.Utf8 columns takes close to 9 seconds.

In [92]: %time chrom_pa_dict = df_pl_strings.get_column("Chromosome").to_arrow().dictionary_encode()
CPU times: user 2.08 s, sys: 128 ms, total: 2.2 s
Wall time: 2.2 s

In [93]: %time name_pa_dict = df_pl_strings.get_column("Name").to_arrow().dictionary_encode()
CPU times: user 6.53 s, sys: 167 ms, total: 6.7 s
Wall time: 6.7 s

Jul 19 '22 12:07 ghuls

Yeap, that makes more sense to me as the local builders seem pretty optimized.

Jul 19 '22 13:07 ritchie46

Still quite surprised it takes so long to create categoricals, basically longer than reading the whole file.

Jul 19 '22 14:07 ghuls

Yes, you need to hash those strings and store them in a hashmap. That's expensive.

>>> %%time
>>> df["pickup_datetime"].cast(pl.Categorical)
CPU times: user 1.91 s, sys: 162 ms, total: 2.08 s
Wall time: 2.06 s

>>> %%time
>>> df["pickup_datetime"].to_arrow().dictionary_encode()
CPU times: user 2.08 s, sys: 163 ms, total: 2.24 s
Wall time: 2.23 s

Maybe we should allow a Categorical builder in the csv reader...

Jul 19 '22 15:07 ritchie46

Global string cache is way faster now for the case above (after #4087):

In [4]: %time df_pl = read_fragments_from_file(fragments_bed_filename, engine='polars')
CPU times: user 53.7 s, sys: 8.24 s, total: 1min 1s
Wall time: 17.2 s

In [3]: %time df_pa_pl = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
CPU times: user 1min, sys: 9.16 s, total: 1min 9s
Wall time: 11.3 s


In [5]: %%time
   ...: with pl.StringCache():
   ...:     df_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='polars')
   ...: 
CPU times: user 53.2 s, sys: 8.33 s, total: 1min 1s
Wall time: 17.2 s

In [6]: %%time
   ...: with pl.StringCache():
   ...:     df_pa_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
   ...: 
CPU times: user 1min 9s, sys: 9.69 s, total: 1min 19s
Wall time: 21 s

Jul 20 '22 09:07 ghuls

Was closed by the wrong PR.

Global string cache is way faster now for the case above (after https://github.com/pola-rs/polars/pull/4087):

Wow there is almost no overhead of the global string cache in your use case! :tada:

I am positive the difference is now only due to creating the categoricals upon csv reading.

Jul 20 '22 10:07 ritchie46

Wow there is almost no overhead of the global string cache in your use case! tada

Thanks for the great improvement. It is getting to acceptable levels :-P. A fast global string cache is great for my case. Although it might not strictly need a global string case for the code that I have in mind.

As the "Name" column needs to be a categorical and at a certain point this column gets filtered by a series with a few 1000 elements. So the possibility for creating a categorical series with the same "local" cache as the "Name" column would be good enough.

Jul 20 '22 11:07 ghuls

Building categoricals in CSV reader was implemented in: https://github.com/pola-rs/polars/pull/4933

Oct 21 '22 21:10 ghuls

polars polars copied to clipboard

CSV: build categoricals directly

Describe your feature request

polars
polars copied to clipboard