dask-snowflake
dask-snowflake copied to clipboard
specifying a partition size breaks down with larger datasets
In #39, one of the tests I added uses a 12-month dataset instead of a 1-month dataset. When we try to fetch the data in 2 MiB partition, we're generally successful in with the smaller dataset. But things are consistently wrong in both directions with the larger dataset.
Copy/pasted from here:
N.B. -- the check that we perform is comparing actual partition sizes to 2x the requested partition size.
(Pdb) from dask.utils import format_bytes
(Pdb) partition_sizes.map(format_bytes).to_frame("result").assign(expected="2 MiB")
result expected
0 1.60 MiB 2 MiB
1 1.71 MiB 2 MiB
2 2.18 MiB 2 MiB
3 3.51 MiB 2 MiB
4 1.60 MiB 2 MiB
5 1.71 MiB 2 MiB
6 2.18 MiB 2 MiB
7 4.36 MiB 2 MiB
8 1.39 MiB 2 MiB
9 875.77 kiB 2 MiB
10 1.71 MiB 2 MiB
11 2.18 MiB 2 MiB
12 3.72 MiB 2 MiB
13 1.60 MiB 2 MiB
14 1.70 MiB 2 MiB
15 2.18 MiB 2 MiB
16 1.69 MiB 2 MiB
17 1.28 MiB 2 MiB
18 1.70 MiB 2 MiB
19 2.18 MiB 2 MiB
20 4.37 MiB 2 MiB
21 1.82 MiB 2 MiB
22 1.70 MiB 2 MiB
23 2.18 MiB 2 MiB
24 3.30 MiB 2 MiB
25 1.60 MiB 2 MiB
26 1.71 MiB 2 MiB
27 2.18 MiB 2 MiB
28 4.37 MiB 2 MiB
29 2.79 MiB 2 MiB
30 1.60 MiB 2 MiB
31 1.71 MiB 2 MiB
32 2.18 MiB 2 MiB
33 4.37 MiB 2 MiB
34 1.29 MiB 2 MiB
I'll start a PR to investigate this further.