parquet-compression for lgbm
initial try for a parquet compression implementation
this is a draft!
core idea:
- create dataframe / table-like structure from tree information
- encode with a custom pyarrow schema
- write arrow Table into parquet-File, piggybacking (🐷 ) off of pyarrow/parquet's compression tools and efficient data format
- instead of writing to a file, write into bytes object that then gets put into the pickle
todos:
- [x] implement dumping
- [x] implement loading
- [ ] remove pandas inter-dependency
- [ ] fix tests
- [ ] validate effect of compression on the table (i.e. not on the pickle, though that should help as well)
- [ ] speeeeeeed
(benchmark 4547923027 / attempt 1) Base results / Our results / Change
| Model | Size | Dump Time | Load Time |
|---|---|---|---|
| sklearn rf 20M | 20.8 MiB / 3.0 MiB / 6.87 x | 0.02 s / 0.05 s / 2.22 x | 0.02 s / 0.04 s / 1.99 x |
| sklearn rf 20M lzma | 6.5 MiB / 2.0 MiB / 3.26 x | 13.29 s / 1.37 s / 0.10 x | 0.63 s / 0.23 s / 0.36 x |
| sklearn rf 200M | 212.3 MiB / 30.6 MiB / 6.94 x | 0.18 s / 0.41 s / 2.21 x | 0.21 s / 0.40 s / 1.91 x |
| sklearn rf 200M lzma | 47.4 MiB / 14.6 MiB / 3.24 x | 112.79 s / 21.01 s / 0.19 x | 4.52 s / 1.62 s / 0.36 x |
| sklearn rf 1G | 1157.5 MiB / 166.8 MiB / 6.94 x | 1.28 s / 2.06 s / 1.61 x | 1.25 s / 1.82 s / 1.46 x |
| sklearn rf 1G lzma | 258.1 MiB / 98.1 MiB / 2.63 x | 587.17 s / 130.46 s / 0.22 x | 26.33 s / 9.61 s / 0.36 x |
| sklearn gb 2M | 2.2 MiB / 1.1 MiB / 2.08 x | 0.04 s / 0.47 s / 11.31 x | 0.07 s / 0.22 s / 3.25 x |
| sklearn gb 2M lzma | 0.6 MiB / 0.2 MiB / 3.81 x | 1.14 s / 0.64 s / 0.56 x | 0.13 s / 0.28 s / 2.18 x |
| lgbm gbdt 2M | 2.6 MiB / 0.5 MiB / 5.02 x | 0.12 s / 0.47 s / 3.95 x | 0.01 s / 0.50 s / 34.10 x |
| lgbm gbdt 2M lzma | 0.9 MiB / 0.5 MiB / 1.72 x | 1.77 s / 0.49 s / 0.28 x | 0.09 s / 0.37 s / 4.23 x |
| lgbm gbdt 5M | 5.3 MiB / 1.0 MiB / 5.37 x | 0.22 s / 0.74 s / 3.33 x | 0.03 s / 0.69 s / 25.00 x |
| lgbm gbdt 5M lzma | 1.7 MiB / 0.9 MiB / 1.81 x | 4.33 s / 0.97 s / 0.22 x | 0.17 s / 0.74 s / 4.46 x |
| lgbm gbdt 20M | 22.7 MiB / 3.2 MiB / 7.06 x | 0.90 s / 2.78 s / 3.07 x | 0.12 s / 2.71 s / 22.19 x |
| lgbm gbdt 20M lzma | 6.3 MiB / 3.1 MiB / 2.05 x | 23.85 s / 3.83 s / 0.16 x | 0.67 s / 2.66 s / 3.95 x |
| lgbm gbdt 100M | 101.1 MiB / 10.5 MiB / 9.61 x | 3.96 s / 12.34 s / 3.12 x | 0.65 s / 11.50 s / 17.75 x |
| lgbm gbdt 100M lzma | 25.6 MiB / 10.0 MiB / 2.55 x | 107.94 s / 17.34 s / 0.16 x | 2.81 s / 12.62 s / 4.49 x |
| lgbm rf 10M | 10.9 MiB / 0.9 MiB / 12.56 x | 0.45 s / 1.10 s / 2.42 x | 0.05 s / 0.69 s / 13.89 x |
| lgbm rf 10M lzma | 0.7 MiB / 0.8 MiB / 0.88 x | 2.45 s / 1.34 s / 0.55 x | 0.14 s / 0.72 s / 5.22 x |
Current state in performance:
Timer unit: 1e-06 s
Total time: 4.75943 s
File: /Users/ytatar/projects/2302-pickle-compression/slim_trees/lgbm_booster.py
Function: _decompress_booster_handle at line 178
Line # Hits Time Per Hit % Time Line Contents
==============================================================
178 # @profile
179 def _decompress_booster_handle(
180 compressed_state: Tuple[str, bytes, bytes, bytes, str]
181 ) -> str:
182 10 4.0 0.4 0.0 (
183 10 0.0 0.0 0.0 front_str,
184 10 0.0 0.0 0.0 trees_df_bytes,
185 10 0.0 0.0 0.0 nodes_df_bytes,
186 10 0.0 0.0 0.0 leaf_value_bytes,
187 10 0.0 0.0 0.0 back_str,
188 10 2.0 0.2 0.0 ) = compressed_state
189 10 4.0 0.4 0.0 assert type(front_str) == str
190 10 4.0 0.4 0.0 assert type(back_str) == str
191
192 10 34970.0 3497.0 0.7 trees_df = pq_bytes_to_df(trees_df_bytes)
193 10 2020509.0 202050.9 42.5 nodes_df = pq_bytes_to_df(nodes_df_bytes).groupby("tree_idx").agg(lambda x: list(x))
194 10 2.0 0.2 0.0 leaf_values_df = (
195 10 357533.0 35753.3 7.5 pq_bytes_to_df(leaf_value_bytes).groupby("tree_idx")["leaf_value"].apply(list)
196 )
197 # merge trees_df, nodes_df, and leaf_values_df on tree_idx
198 10 11741.0 1174.1 0.2 trees_df = trees_df.merge(nodes_df, on="tree_idx")
199 10 9704.0 970.4 0.2 trees_df = trees_df.merge(leaf_values_df, on="tree_idx")
200
201 # handle = front_str
202
203 10 3.0 0.3 0.0 tree_strings = [front_str]
204
205 # TODO: directly go over trees and nodes
206 20000 620494.0 31.0 13.0 for i, tree in trees_df.iterrows():
222
223 20000 79986.0 4.0 1.7 num_leaves = int(tree["num_leaves"])
224 20000 2578.0 0.1 0.1 num_nodes = num_leaves - 1
225
226 20000 26030.0 1.3 0.5 tree_strings.append(f"""Tree={i}
227 20000 61575.0 3.1 1.3 num_leaves={int(tree["num_leaves"])}
228 20000 59082.0 3.0 1.2 num_cat={tree['num_cat']}
229 20000 148784.0 7.4 3.1 split_feature={' '.join([str(x) for x in tree["split_feature"]])}
230 20000 4304.0 0.2 0.1 split_gain={("0" * num_nodes)[:-1]}
231 20000 357808.0 17.9 7.5 threshold={' '.join([str(x) for x in tree['threshold']])}
232 20000 149148.0 7.5 3.1 decision_type={' '.join([str(x) for x in tree["decision_type"]])}
233 20000 151402.0 7.6 3.2 left_child={" ".join([str(x) for x in tree["left_child"]])}
234 20000 151562.0 7.6 3.2 right_child={" ".join([str(x) for x in tree["right_child"]])}
235 20000 369298.0 18.5 7.8 leaf_value={" ".join([str(x) for x in tree["leaf_value"]])}
236 20000 4519.0 0.2 0.1 leaf_weight={("0 " * num_leaves)[:-1]}
237 20000 4008.0 0.2 0.1 leaf_count={("0 " * num_leaves)[:-1]}
238 20000 3483.0 0.2 0.1 internal_value={("0 " * num_nodes)[:-1]}
239 20000 3363.0 0.2 0.1 internal_weight={("0 " * num_nodes)[:-1]}
240 20000 3289.0 0.2 0.1 internal_count={("0 " * num_nodes)[:-1]}
241 20000 61646.0 3.1 1.3 is_linear={tree['is_linear']}
242 20000 57860.0 2.9 1.2 shrinkage={tree['shrinkage']}
243
244
276
277 10 1.0 0.1 0.0 tree_strings.append(back_str)
278
279 # handle += back_str
280 10 4732.0 473.2 0.1 return "".join(tree_strings)
I seem to be missing some values that are required for inference, hence all the 0.0 prediction values. Otherwise, this should work fine. (ex. num_cat still)
Fun fact, float printing performance seems to scale with the number of digits
l = [i/17. for i in range(1_000)]
In [2]: %timeit ", ".join(str(x) for x in l)
364 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [3]: %timeit ", ".join(map(str, l))
344 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [5]: %timeit ", ".join(f"{x:.20f}" for x in l)
407 µs ± 292 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [6]: %timeit ", ".join(f"{x:.15f}" for x in l)
338 µs ± 349 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [7]: %timeit ", ".join(f"{x:.12f}" for x in l)
248 µs ± 364 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit ", ".join(f"{x:.5f}" for x in l)
184 µs ± 286 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Interestingly with Parquet and LZMA the compression is not improved a lot. Only in the LGBM gbdt large LZMA case
Do we want to use float32 for values for lgbm? For consistency with sklearn
Do we want to use float32 for values for lgbm? For consistency with sklearn
should work, I guess. We have tests for prediction performance, anyway. Though they currently fail because of some parsing issue 😬
LGBM gbdt large: 3.05x -> 7.15x LGBM gbdt large LZMA: 2.29x -> 2.54x
While this is a great performance improvement without using LZMA at the end, the improvement is not that big when only using LZMA. Maybe it makes sense to add an option not to use the pyarrow way but to use the old implementation instead if pyarrow is not installed?
But I suggest moving this discussion to when the PR is almost ready to merge.
The table suggests that the factor improves especially in larger models. Unfortunately, we have lost some performance again, I seem to remember partially up to 11x.
Old: Base results / Our results / Change
| Model | Size | Dump Time | Load Time |
|---|---|---|---|
| lgbm gbdt 2M | 2.6 MiB / 1.0 MiB / 2.78 x | 0.09 s / 0.25 s / 2.78 x | 0.01 s / 0.15 s / 12.14 x |
| lgbm gbdt 2M lzma | 0.9 MiB / 0.5 MiB / 1.90 x | 1.31 s / 0.51 s / 0.39 x | 0.08 s / 0.22 s / 2.73 x |
| lgbm gbdt 5M | 5.3 MiB / 1.9 MiB / 2.81 x | 0.18 s / 0.51 s / 2.82 x | 0.03 s / 0.32 s / 12.54 x |
| lgbm gbdt 5M lzma | 1.7 MiB / 0.8 MiB / 1.96 x | 3.12 s / 1.07 s / 0.34 x | 0.15 s / 0.38 s / 2.55 x |
| lgbm gbdt 20M | 22.7 MiB / 7.6 MiB / 3.00 x | 0.72 s / 2.11 s / 2.95 x | 0.11 s / 1.32 s / 11.83 x |
| lgbm gbdt 20M lzma | 6.3 MiB / 3.0 MiB / 2.09 x | 16.68 s / 4.86 s / 0.29 x | 0.59 s / 1.54 s / 2.61 x |
| lgbm gbdt 100M | 101.1 MiB / 33.0 MiB / 3.06 x | 3.23 s / 9.39 s / 2.91 x | 0.53 s / 116.98 s / 219.18 x |
| lgbm gbdt 100M lzma | 25.6 MiB / 10.6 MiB / 2.41 x | 83.14 s / 23.69 s / 0.28 x | 2.44 s / 98.06 s / 40.16 x |
| lgbm rf 10M | 10.9 MiB / 3.2 MiB / 3.46 x | 0.36 s / 0.69 s / 1.92 x | 0.04 s / 0.56 s / 12.54 x |
| lgbm rf 10M lzma | 0.7 MiB / 0.4 MiB / 1.85 x | 1.87 s / 0.95 s / 0.51 x | 0.12 s / 0.58 s / 4.84 x |
New: Base results / Our results / Change
| Model | Size | Dump Time | Load Time |
|---|---|---|---|
| lgbm gbdt 2M | 2.6 MiB / 0.5 MiB / 5.02 x | 0.09 s / 0.37 s / 4.08 x | 0.01 s / 0.30 s / 24.32 x |
| lgbm gbdt 2M lzma | 0.9 MiB / 0.5 MiB / 1.72 x | 1.33 s / 0.41 s / 0.31 x | 0.08 s / 0.26 s / 3.31 x |
| lgbm gbdt 5M | 5.3 MiB / 1.0 MiB / 5.37 x | 0.18 s / 0.56 s / 3.15 x | 0.02 s / 0.45 s / 17.93 x |
| lgbm gbdt 5M lzma | 1.7 MiB / 0.9 MiB / 1.81 x | 3.26 s / 0.78 s / 0.24 x | 0.15 s / 0.50 s / 3.38 x |
| lgbm gbdt 20M | 22.7 MiB / 3.2 MiB / 7.06 x | 0.71 s / 2.16 s / 3.03 x | 0.11 s / 1.84 s / 16.68 x |
| lgbm gbdt 20M lzma | 6.3 MiB / 3.1 MiB / 2.05 x | 17.84 s / 3.02 s / 0.17 x | 0.59 s / 1.89 s / 3.19 x |
| lgbm gbdt 100M | 101.1 MiB / 10.5 MiB / 9.61 x | 3.17 s / 9.45 s / 2.98 x | 0.51 s / 7.94 s / 15.59 x |
| lgbm gbdt 100M lzma | 25.6 MiB / 10.0 MiB / 2.55 x | 81.52 s / 13.03 s / 0.16 x | 2.43 s / 8.11 s / 3.34 x |
| lgbm rf 10M | 10.9 MiB / 0.9 MiB / 12.56 x | 0.36 s / 0.88 s / 2.45 x | 0.04 s / 0.46 s / 10.66 x |
| lgbm rf 10M lzma | 0.7 MiB / 0.8 MiB / 0.88 x | 1.91 s / 1.09 s / 0.57 x | 0.12 s / 0.51 s / 4.25 x |
As discussed with @pavelzw today, how about splitting the non-Parquet parts into a separate PR? It's annoying that Parquet doesn't improve the compression ratio despite the effort that you have put into this @YYYasin19. But let's make the best out of it by keeping the other parts!