slim-trees parquet-compression for lgbm

initial try for a parquet compression implementation

this is a draft!

core idea:

create dataframe / table-like structure from tree information
encode with a custom pyarrow schema
write arrow Table into parquet-File, piggybacking (🐷 ) off of pyarrow/parquet's compression tools and efficient data format
instead of writing to a file, write into bytes object that then gets put into the pickle

todos:

[x] implement dumping
[x] implement loading
[ ] remove pandas inter-dependency
[ ] fix tests
[ ] validate effect of compression on the table (i.e. not on the pickle, though that should help as well)
[ ] speeeeeeed

Mar 05 '23 11:03 YYYasin19

(benchmark 4547923027 / attempt 1) Base results / Our results / Change

Model	Size	Dump Time	Load Time
sklearn rf 20M	20.8 MiB / 3.0 MiB / 6.87 x	0.02 s / 0.05 s / 2.22 x	0.02 s / 0.04 s / 1.99 x
sklearn rf 20M lzma	6.5 MiB / 2.0 MiB / 3.26 x	13.29 s / 1.37 s / 0.10 x	0.63 s / 0.23 s / 0.36 x
sklearn rf 200M	212.3 MiB / 30.6 MiB / 6.94 x	0.18 s / 0.41 s / 2.21 x	0.21 s / 0.40 s / 1.91 x
sklearn rf 200M lzma	47.4 MiB / 14.6 MiB / 3.24 x	112.79 s / 21.01 s / 0.19 x	4.52 s / 1.62 s / 0.36 x
sklearn rf 1G	1157.5 MiB / 166.8 MiB / 6.94 x	1.28 s / 2.06 s / 1.61 x	1.25 s / 1.82 s / 1.46 x
sklearn rf 1G lzma	258.1 MiB / 98.1 MiB / 2.63 x	587.17 s / 130.46 s / 0.22 x	26.33 s / 9.61 s / 0.36 x
sklearn gb 2M	2.2 MiB / 1.1 MiB / 2.08 x	0.04 s / 0.47 s / 11.31 x	0.07 s / 0.22 s / 3.25 x
sklearn gb 2M lzma	0.6 MiB / 0.2 MiB / 3.81 x	1.14 s / 0.64 s / 0.56 x	0.13 s / 0.28 s / 2.18 x
lgbm gbdt 2M	2.6 MiB / 0.5 MiB / 5.02 x	0.12 s / 0.47 s / 3.95 x	0.01 s / 0.50 s / 34.10 x
lgbm gbdt 2M lzma	0.9 MiB / 0.5 MiB / 1.72 x	1.77 s / 0.49 s / 0.28 x	0.09 s / 0.37 s / 4.23 x
lgbm gbdt 5M	5.3 MiB / 1.0 MiB / 5.37 x	0.22 s / 0.74 s / 3.33 x	0.03 s / 0.69 s / 25.00 x
lgbm gbdt 5M lzma	1.7 MiB / 0.9 MiB / 1.81 x	4.33 s / 0.97 s / 0.22 x	0.17 s / 0.74 s / 4.46 x
lgbm gbdt 20M	22.7 MiB / 3.2 MiB / 7.06 x	0.90 s / 2.78 s / 3.07 x	0.12 s / 2.71 s / 22.19 x
lgbm gbdt 20M lzma	6.3 MiB / 3.1 MiB / 2.05 x	23.85 s / 3.83 s / 0.16 x	0.67 s / 2.66 s / 3.95 x
lgbm gbdt 100M	101.1 MiB / 10.5 MiB / 9.61 x	3.96 s / 12.34 s / 3.12 x	0.65 s / 11.50 s / 17.75 x
lgbm gbdt 100M lzma	25.6 MiB / 10.0 MiB / 2.55 x	107.94 s / 17.34 s / 0.16 x	2.81 s / 12.62 s / 4.49 x
lgbm rf 10M	10.9 MiB / 0.9 MiB / 12.56 x	0.45 s / 1.10 s / 2.42 x	0.05 s / 0.69 s / 13.89 x
lgbm rf 10M lzma	0.7 MiB / 0.8 MiB / 0.88 x	2.45 s / 1.34 s / 0.55 x	0.14 s / 0.72 s / 5.22 x

Mar 06 '23 14:03 github-actions[bot]

Current state in performance:

Timer unit: 1e-06 s

Total time: 4.75943 s
File: /Users/ytatar/projects/2302-pickle-compression/slim_trees/lgbm_booster.py
Function: _decompress_booster_handle at line 178

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   178                                           # @profile
   179                                           def _decompress_booster_handle(
   180                                                   compressed_state: Tuple[str, bytes, bytes, bytes, str]
   181                                           ) -> str:
   182        10          4.0      0.4      0.0      (
   183        10          0.0      0.0      0.0          front_str,
   184        10          0.0      0.0      0.0          trees_df_bytes,
   185        10          0.0      0.0      0.0          nodes_df_bytes,
   186        10          0.0      0.0      0.0          leaf_value_bytes,
   187        10          0.0      0.0      0.0          back_str,
   188        10          2.0      0.2      0.0      ) = compressed_state
   189        10          4.0      0.4      0.0      assert type(front_str) == str
   190        10          4.0      0.4      0.0      assert type(back_str) == str
   191                                           
   192        10      34970.0   3497.0      0.7      trees_df = pq_bytes_to_df(trees_df_bytes)
   193        10    2020509.0 202050.9     42.5      nodes_df = pq_bytes_to_df(nodes_df_bytes).groupby("tree_idx").agg(lambda x: list(x))
   194        10          2.0      0.2      0.0      leaf_values_df = (
   195        10     357533.0  35753.3      7.5          pq_bytes_to_df(leaf_value_bytes).groupby("tree_idx")["leaf_value"].apply(list)
   196                                               )
   197                                               # merge trees_df, nodes_df, and leaf_values_df on tree_idx
   198        10      11741.0   1174.1      0.2      trees_df = trees_df.merge(nodes_df, on="tree_idx")
   199        10       9704.0    970.4      0.2      trees_df = trees_df.merge(leaf_values_df, on="tree_idx")
   200                                           
   201                                               # handle = front_str
   202                                           
   203        10          3.0      0.3      0.0      tree_strings = [front_str]
   204                                           
   205                                               # TODO: directly go over trees and nodes
   206     20000     620494.0     31.0     13.0      for i, tree in trees_df.iterrows():

   222                                           
   223     20000      79986.0      4.0      1.7          num_leaves = int(tree["num_leaves"])
   224     20000       2578.0      0.1      0.1          num_nodes = num_leaves - 1
   225                                           
   226     20000      26030.0      1.3      0.5          tree_strings.append(f"""Tree={i}
   227     20000      61575.0      3.1      1.3  num_leaves={int(tree["num_leaves"])}
   228     20000      59082.0      3.0      1.2  num_cat={tree['num_cat']}
   229     20000     148784.0      7.4      3.1  split_feature={' '.join([str(x) for x in tree["split_feature"]])}
   230     20000       4304.0      0.2      0.1  split_gain={("0" * num_nodes)[:-1]}
   231     20000     357808.0     17.9      7.5  threshold={' '.join([str(x) for x in tree['threshold']])}
   232     20000     149148.0      7.5      3.1  decision_type={' '.join([str(x) for x in tree["decision_type"]])}
   233     20000     151402.0      7.6      3.2  left_child={" ".join([str(x) for x in tree["left_child"]])}
   234     20000     151562.0      7.6      3.2  right_child={" ".join([str(x) for x in tree["right_child"]])}
   235     20000     369298.0     18.5      7.8  leaf_value={" ".join([str(x) for x in tree["leaf_value"]])}
   236     20000       4519.0      0.2      0.1  leaf_weight={("0 " * num_leaves)[:-1]}
   237     20000       4008.0      0.2      0.1  leaf_count={("0 " * num_leaves)[:-1]}
   238     20000       3483.0      0.2      0.1  internal_value={("0 " * num_nodes)[:-1]}
   239     20000       3363.0      0.2      0.1  internal_weight={("0 " * num_nodes)[:-1]}
   240     20000       3289.0      0.2      0.1  internal_count={("0 " * num_nodes)[:-1]}
   241     20000      61646.0      3.1      1.3  is_linear={tree['is_linear']}
   242     20000      57860.0      2.9      1.2  shrinkage={tree['shrinkage']}
   243                                           
   244                                           
  
   276                                           
   277        10          1.0      0.1      0.0      tree_strings.append(back_str)
   278                                           
   279                                               # handle += back_str
   280        10       4732.0    473.2      0.1      return "".join(tree_strings)

Mar 06 '23 16:03 YYYasin19

I seem to be missing some values that are required for inference, hence all the 0.0 prediction values. Otherwise, this should work fine. (ex. num_cat still)

Mar 17 '23 16:03 YYYasin19

Fun fact, float printing performance seems to scale with the number of digits

l = [i/17. for i in range(1_000)]

In [2]: %timeit ", ".join(str(x) for x in l)
364 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit ", ".join(map(str, l))
344 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [5]: %timeit ", ".join(f"{x:.20f}" for x in l)
407 µs ± 292 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit ", ".join(f"{x:.15f}" for x in l)
338 µs ± 349 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [7]: %timeit ", ".join(f"{x:.12f}" for x in l)
248 µs ± 364 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit ", ".join(f"{x:.5f}" for x in l)
184 µs ± 286 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Mar 19 '23 15:03 jonashaag

Interestingly with Parquet and LZMA the compression is not improved a lot. Only in the LGBM gbdt large LZMA case

Mar 19 '23 17:03 jonashaag

Do we want to use float32 for values for lgbm? For consistency with sklearn

Mar 19 '23 17:03 jonashaag

Do we want to use float32 for values for lgbm? For consistency with sklearn

should work, I guess. We have tests for prediction performance, anyway. Though they currently fail because of some parsing issue 😬

Mar 19 '23 18:03 YYYasin19

LGBM gbdt large: 3.05x -> 7.15x LGBM gbdt large LZMA: 2.29x -> 2.54x

While this is a great performance improvement without using LZMA at the end, the improvement is not that big when only using LZMA. Maybe it makes sense to add an option not to use the pyarrow way but to use the old implementation instead if pyarrow is not installed?

But I suggest moving this discussion to when the PR is almost ready to merge.

Mar 20 '23 12:03 pavelzw

The table suggests that the factor improves especially in larger models. Unfortunately, we have lost some performance again, I seem to remember partially up to 11x.

Mar 24 '23 17:03 YYYasin19

Old: Base results / Our results / Change

Model	Size	Dump Time	Load Time
lgbm gbdt 2M	2.6 MiB / 1.0 MiB / 2.78 x	0.09 s / 0.25 s / 2.78 x	0.01 s / 0.15 s / 12.14 x
lgbm gbdt 2M lzma	0.9 MiB / 0.5 MiB / 1.90 x	1.31 s / 0.51 s / 0.39 x	0.08 s / 0.22 s / 2.73 x
lgbm gbdt 5M	5.3 MiB / 1.9 MiB / 2.81 x	0.18 s / 0.51 s / 2.82 x	0.03 s / 0.32 s / 12.54 x
lgbm gbdt 5M lzma	1.7 MiB / 0.8 MiB / 1.96 x	3.12 s / 1.07 s / 0.34 x	0.15 s / 0.38 s / 2.55 x
lgbm gbdt 20M	22.7 MiB / 7.6 MiB / 3.00 x	0.72 s / 2.11 s / 2.95 x	0.11 s / 1.32 s / 11.83 x
lgbm gbdt 20M lzma	6.3 MiB / 3.0 MiB / 2.09 x	16.68 s / 4.86 s / 0.29 x	0.59 s / 1.54 s / 2.61 x
lgbm gbdt 100M	101.1 MiB / 33.0 MiB / 3.06 x	3.23 s / 9.39 s / 2.91 x	0.53 s / 116.98 s / 219.18 x
lgbm gbdt 100M lzma	25.6 MiB / 10.6 MiB / 2.41 x	83.14 s / 23.69 s / 0.28 x	2.44 s / 98.06 s / 40.16 x
lgbm rf 10M	10.9 MiB / 3.2 MiB / 3.46 x	0.36 s / 0.69 s / 1.92 x	0.04 s / 0.56 s / 12.54 x
lgbm rf 10M lzma	0.7 MiB / 0.4 MiB / 1.85 x	1.87 s / 0.95 s / 0.51 x	0.12 s / 0.58 s / 4.84 x

New: Base results / Our results / Change

Model	Size	Dump Time	Load Time
lgbm gbdt 2M	2.6 MiB / 0.5 MiB / 5.02 x	0.09 s / 0.37 s / 4.08 x	0.01 s / 0.30 s / 24.32 x
lgbm gbdt 2M lzma	0.9 MiB / 0.5 MiB / 1.72 x	1.33 s / 0.41 s / 0.31 x	0.08 s / 0.26 s / 3.31 x
lgbm gbdt 5M	5.3 MiB / 1.0 MiB / 5.37 x	0.18 s / 0.56 s / 3.15 x	0.02 s / 0.45 s / 17.93 x
lgbm gbdt 5M lzma	1.7 MiB / 0.9 MiB / 1.81 x	3.26 s / 0.78 s / 0.24 x	0.15 s / 0.50 s / 3.38 x
lgbm gbdt 20M	22.7 MiB / 3.2 MiB / 7.06 x	0.71 s / 2.16 s / 3.03 x	0.11 s / 1.84 s / 16.68 x
lgbm gbdt 20M lzma	6.3 MiB / 3.1 MiB / 2.05 x	17.84 s / 3.02 s / 0.17 x	0.59 s / 1.89 s / 3.19 x
lgbm gbdt 100M	101.1 MiB / 10.5 MiB / 9.61 x	3.17 s / 9.45 s / 2.98 x	0.51 s / 7.94 s / 15.59 x
lgbm gbdt 100M lzma	25.6 MiB / 10.0 MiB / 2.55 x	81.52 s / 13.03 s / 0.16 x	2.43 s / 8.11 s / 3.34 x
lgbm rf 10M	10.9 MiB / 0.9 MiB / 12.56 x	0.36 s / 0.88 s / 2.45 x	0.04 s / 0.46 s / 10.66 x
lgbm rf 10M lzma	0.7 MiB / 0.8 MiB / 0.88 x	1.91 s / 1.09 s / 0.57 x	0.12 s / 0.51 s / 4.25 x

Mar 28 '23 15:03 pavelzw

As discussed with @pavelzw today, how about splitting the non-Parquet parts into a separate PR? It's annoying that Parquet doesn't improve the compression ratio despite the effort that you have put into this @YYYasin19. But let's make the best out of it by keeping the other parts!

Apr 25 '23 10:04 jonashaag