fastparquet
fastparquet copied to clipboard
Philosophical question regarding `row_group_offsets` evaluation in `write`
Hello,
I am wondering about the code assessing row group offsets from a row group size in write.
l = len(data) #1
nparts = max((l - 1) // row_group_offsets + 1, 1) #2
chunksize = max(min((l - 1) // nparts + 1, l), 1) #3
row_group_offsets = list(range(0, l, chunksize))
data being necessarily at least 1 row,
- if it is 1 row, line 2 reads
max(1,1) - whatever the number of rows in
data, whateverrow_group_offsets, min value of(l -1) // row_group_offsets + 1is necessarily 1 as I can figure it out. So why amaxat line 2? Line 2 could simply read:nparts = (l - 1) // row_group_offsets + 1
I see the same situation with line 3. min((l-1) // nparts + 1, l) is necessarily at least 1.
So why a max?
It seems line 3 could read directly:
chunksize = min((l - 1) // nparts + 1, l)
If I am missing something, I am sorry. Thanks for your feedback. Bests
I don't know, this code is very old. You could put it in a helper function with dedicated tests, to ensure that it behaves as expected with a range of inputs with or without the possibly redundant max. On the other hand, it's not causing any problems as it is, right? Maybe there was a wish to deal with zero-length dataframes?
You could put it in a helper function with dedicated tests, to ensure that it behaves as expected with a range of inputs with or without the possibly redundant
max.
Yes, will do after we are done with PR #712
In this PR, this part of the code has indeed been moved into a separate function (iter_dataframe()) If integrated, it will indeed bee easy to check how is split the dataframe for different cases.
On the other hand, it's not causing any problems as it is, right? Maybe there was a wish to deal with zero-length dataframes?
As far as I can see, no it is not, but it makes the code more complex to read. I was trying to understand simply by figuring out in my mind, and it is not so direct and in the end, I do think that these max do not bring anything, even for a case of 'zero-length detafram' actually.
As said, leaving open, and will check after we are done with PR #712.