polars icon indicating copy to clipboard operation
polars copied to clipboard

CSV: build categoricals directly

Open ghuls opened this issue 3 years ago • 11 comments

Describe your feature request

Tryout different ways to create Categoricals.

Currently creating categoricals can be very slow.

import pyarrow as pa
import polars as pl


def read_fragments_from_file(
    fragments_bed_filename: str, engine: Union[str, Literal["polars"], Literal["pyarrow"]] = "polars"
) -> pl.DataFrame:

    bed_column_names = (
        "Chromosome",
        "Start",
        "End",
        "Name",
        "Score",
    )

    # Set the correct open function depending if the fragments BED file is gzip compressed or not.
    open_fn = gzip.open if fragments_bed_filename.endswith(".gz") else open

    skip_rows = 0
    nbr_columns = 5

    if engine == "polars":
        import polars as pl

        # Read fragments BED file with polars.
        df = (
            pl.read_csv(
                fragments_bed_filename,
                has_headers=False,
                skip_rows=skip_rows,
                sep="\t",
                use_pyarrow=False,
                new_columns=bed_column_names[:nbr_columns],
                dtypes={
                    "Chromosome": pl.Categorical,
                    "Start": pl.Int32,
                    "End": pl.Int32,
                    "Name": pl.Categorical,
                    "Strand": pl.Categorical,
                },
            )
        )
    elif engine == "pyarrow":
        # Read fragments BED file with pyarrow, and create categoricals with pyarrow and convert them to polars dataframe.
        df = pl.from_arrow(
            pa.csv.read_csv(
                fragments_bed_filename,
                read_options=pa.csv.ReadOptions(
                    use_threads=True,
                    skip_rows=0,
                    column_names=bed_column_names[:nbr_columns],
                ),
                parse_options= pa.csv.ParseOptions(
                    delimiter="\t",
                    quote_char=False,
                    escape_char=False,
                    newlines_in_values=False,
                ),
                convert_options= pa.csv.ConvertOptions(
                    column_types={
                        "Chromosome": pa.dictionary(pa.int32(), pa.large_string()),
                        "Start": pa.int32(),
                        "End": pa.int32(),
                        "Name": pa.dictionary(pa.int32(), pa.large_string()),
                        "Strand": pa.dictionary(pa.int32(), pa.large_string()),
                    },
                )
            )
        )

    return df

Creating catergoricals directly in polars takes 6-7 seconds with polars, while creating them with pyarrow is almost instantaneously (and converting those dicttionaries to polars categoricals is very fast). Both dataframes give the same results as far as I can tell (physical values).

# Read TSV file with polars and create categoricals with polars.
In [8]: %time df_pl = read_fragments_from_file('test.tsv', engine='polars')
CPU times: user 52.6 s, sys: 10.2 s, total: 1min 2s
Wall time: 17.5 s

# Read TSV file with pyarrow, and create categoricals with pyarrow and convert them to polars dataframe.
In [9]: %time df_pa_pl = read_fragments_from_file('test.tsv', engine='pyarrow')
CPU times: user 1min, sys: 9.78 s, total: 1min 10s
Wall time: 11.5 s

It would be worth a try to create categoricals in a similar way as pyarrow and provide methods to create a global string cache from local string caches.

categorical_one = pl.Series("categorical_one", ["c", "a", "b", "c", "b"], dtype=pl.Categorical)
categorical_two = pl.Series("categorical_two", ["b", "e", "a", "b", "c"], dtype=pl.Categorical),
categorical_three = pl.Series("categorical_three", ["e", "c", "b", "e", "a"], dtype=pl.Categorical),

Create all local string caches in parallel:
  - mappings from: int(categorical index) to string 

local string cache "categorical_one"
[0]: c: 0
[1]: a: 1
[2]: b: 2

local string cache "categorical_two"
[0]: b: 0
[1]: e: 1
[2]: a: 2
[3]: c: 3

local string cache "categorical_three"
[0]: e: 0
[1]: c: 1
[2]: b: 2
[3]:a: 3

# Create global string cache from local ones:
#   - append all local string caches and add offset for each local string cache
global string cache:

c: 0
a: 1
b: 2

# Add size of categorical_one to array offsets of categorical_two.
b: 0 + 3 = 3
e: 1 + 3 = 4
a: 2 + 3 = 5
c: 3 + 3 = 6

# Add size of categorical_one and categorical_two to array offsets of categorical_three.
e: 0 + 3 +  4 = 7
c: 1 + 3 +  4 = 8
b: 2 + 3 +  4 = 9
a: 3 + 3 +  4 = 10

# Store in an array:
[0]: c: 0
[1]: a: 1
[2]: b: 2
[3]: b: 3
[4]: e: 4
[5]: a: 5
[6]: c: 6
[7]: e: 7
[8]: c: 8
[9]: b: 9
[10]: a: 10

# Get lowest index for each string so you have a:
# old index to new index mapping.
[0]: c: 0
[1]: a: 1
[2]: b: 2
[3]: b: 2
[4]: e: 4
[5]: a: 1
[6]: c: 0
[7]: e: 4
[8]: c: 0
[9]: b: 2
[10]: a: 1

# Create global string cache with int to string mappings and reverse from old index to new index mapping array.
c: 0
a: 1
b: 2
e: 4

Remap all physical categorical columns with the old index to new index mapping array to update the integers
[0]: 0
[1]: 1
[2]: 2
[3]: 2
[4]: 4
[5]: 1
[6]: 0
[7]: 4
[8]: 0
[9]: 2
[10]: 1

ghuls avatar Jul 18 '22 17:07 ghuls

@ghuls have you got on an example file or some synthetic data where you benchmarked this on? I am curious if you are talking about a local or a global string cache. The local string cache has had performance improvements lately. The global has a lot of contention and I agree we should try to improve that.

ritchie46 avatar Jul 19 '22 06:07 ritchie46

See: #4078

ritchie46 avatar Jul 19 '22 09:07 ritchie46

Here you have an example file in the same format:

curl https://temp.aertslab.org/.tsv/atac_fragments.head40000000.tsv.gz | zcat | tail -n +52 > fragments.head40000000.tsv

The test above was with local string cache.

Below the timings with local and global string cache (larger file that the example file).

# Read file with polars (pl.Utf8, no pl.Categorical):
In [53]: %time df_pl_strings = read_fragments_from_file_as_strings(fragments_bed_filename, engine='polars')
CPU times: user 41.8 s, sys: 7.7 s, total: 49.5 s
Wall time: 6.71 s

# Read file with pyarrow (pa.large_string, no pa.dictionary) and convert to polars dataframe:
In [54]: %time df_pa_pl_strings = read_fragments_from_file_as_strings(fragments_bed_filename, engine='pyarrow')
CPU times: user 39.4 s, sys: 10.4 s, total: 49.8 s
Wall time: 8.54 s


# Read file with polars (pl.Utf8 columns as pl.Categorical) with local string cache:
In [8]: %time df_pl = read_fragments_from_file(fragments_bed_filename, engine='polars')
CPU times: user 51.9 s, sys: 10.1 s, total: 1min 2s
Wall time: 17.4 s

# Read file with polars (pl.Utf8 columns as pl.Categorical) with global string cache:
In [9]: %%time
   ...: with pl.StringCache():
   ...:     df_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='polars')
   ...: 
CPU times: user 56 s, sys: 8.64 s, total: 1min 4s
Wall time: 22.7 s


# Pyarrow dictionaries to polars Categorical with local string cache:
In [10]: %time df_pa_pl = read_fragments_from_file(fragments_bed_filename engine='pyarrow')
CPU times: user 1min 1s, sys: 9.84 s, total: 1min 11s
Wall time: 11.7 s

# Pyarrow dictionaries to polars Categorical with global string cache:
In [11]: %%time
    ...: with pl.StringCache():
    ...:     df_pa_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
    ...: 
CPU times: user 1min 17s, sys: 10.1 s, total: 1min 27s
Wall time: 27.7 s

ghuls avatar Jul 19 '22 11:07 ghuls

Could it be that pyarrow convert the categorical upon reading? Whereas we first read as string column and then convert.

ritchie46 avatar Jul 19 '22 11:07 ritchie46

It looks like it might indeed create the categorical upon reading.

Creating a dictionary with pyarrow from the pl.Utf8 columns takes close to 9 seconds.

In [92]: %time chrom_pa_dict = df_pl_strings.get_column("Chromosome").to_arrow().dictionary_encode()
CPU times: user 2.08 s, sys: 128 ms, total: 2.2 s
Wall time: 2.2 s

In [93]: %time name_pa_dict = df_pl_strings.get_column("Name").to_arrow().dictionary_encode()
CPU times: user 6.53 s, sys: 167 ms, total: 6.7 s
Wall time: 6.7 s

ghuls avatar Jul 19 '22 12:07 ghuls

Yeap, that makes more sense to me as the local builders seem pretty optimized.

ritchie46 avatar Jul 19 '22 13:07 ritchie46

Still quite surprised it takes so long to create categoricals, basically longer than reading the whole file.

ghuls avatar Jul 19 '22 14:07 ghuls

Yes, you need to hash those strings and store them in a hashmap. That's expensive.

>>> %%time
>>> df["pickup_datetime"].cast(pl.Categorical)
CPU times: user 1.91 s, sys: 162 ms, total: 2.08 s
Wall time: 2.06 s
>>> %%time
>>> df["pickup_datetime"].to_arrow().dictionary_encode()
CPU times: user 2.08 s, sys: 163 ms, total: 2.24 s
Wall time: 2.23 s

Maybe we should allow a Categorical builder in the csv reader...

ritchie46 avatar Jul 19 '22 15:07 ritchie46

Global string cache is way faster now for the case above (after #4087):

In [4]: %time df_pl = read_fragments_from_file(fragments_bed_filename, engine='polars')
CPU times: user 53.7 s, sys: 8.24 s, total: 1min 1s
Wall time: 17.2 s

In [3]: %time df_pa_pl = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
CPU times: user 1min, sys: 9.16 s, total: 1min 9s
Wall time: 11.3 s


In [5]: %%time
   ...: with pl.StringCache():
   ...:     df_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='polars')
   ...: 
CPU times: user 53.2 s, sys: 8.33 s, total: 1min 1s
Wall time: 17.2 s

In [6]: %%time
   ...: with pl.StringCache():
   ...:     df_pa_pl_string_cache = read_fragments_from_file(fragments_bed_filename, engine='pyarrow')
   ...: 
CPU times: user 1min 9s, sys: 9.69 s, total: 1min 19s
Wall time: 21 s

ghuls avatar Jul 20 '22 09:07 ghuls

Was closed by the wrong PR.

Global string cache is way faster now for the case above (after https://github.com/pola-rs/polars/pull/4087):

Wow there is almost no overhead of the global string cache in your use case! :tada:

I am positive the difference is now only due to creating the categoricals upon csv reading.

ritchie46 avatar Jul 20 '22 10:07 ritchie46

Wow there is almost no overhead of the global string cache in your use case! tada

Thanks for the great improvement. It is getting to acceptable levels :-P. A fast global string cache is great for my case. Although it might not strictly need a global string case for the code that I have in mind.

As the "Name" column needs to be a categorical and at a certain point this column gets filtered by a series with a few 1000 elements. So the possibility for creating a categorical series with the same "local" cache as the "Name" column would be good enough.

ghuls avatar Jul 20 '22 11:07 ghuls

Building categoricals in CSV reader was implemented in: https://github.com/pola-rs/polars/pull/4933

ghuls avatar Oct 21 '22 21:10 ghuls