fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

Philosophical question regarding `row_group_offsets` evaluation in `write`

Open yohplala opened this issue 3 years ago • 2 comments

Hello,

I am wondering about the code assessing row group offsets from a row group size in write.

            l = len(data)                                            #1
            nparts = max((l - 1) // row_group_offsets + 1, 1)        #2
            chunksize = max(min((l - 1) // nparts + 1, l), 1)        #3
            row_group_offsets = list(range(0, l, chunksize))

data being necessarily at least 1 row,

  • if it is 1 row, line 2 reads max(1,1)
  • whatever the number of rows in data, whatever row_group_offsets, min value of (l -1) // row_group_offsets + 1 is necessarily 1 as I can figure it out. So why a max at line 2? Line 2 could simply read: nparts = (l - 1) // row_group_offsets + 1

I see the same situation with line 3. min((l-1) // nparts + 1, l) is necessarily at least 1. So why a max? It seems line 3 could read directly: chunksize = min((l - 1) // nparts + 1, l)

If I am missing something, I am sorry. Thanks for your feedback. Bests

yohplala avatar Dec 18 '21 18:12 yohplala

I don't know, this code is very old. You could put it in a helper function with dedicated tests, to ensure that it behaves as expected with a range of inputs with or without the possibly redundant max. On the other hand, it's not causing any problems as it is, right? Maybe there was a wish to deal with zero-length dataframes?

martindurant avatar Dec 20 '21 14:12 martindurant

You could put it in a helper function with dedicated tests, to ensure that it behaves as expected with a range of inputs with or without the possibly redundant max.

Yes, will do after we are done with PR #712 In this PR, this part of the code has indeed been moved into a separate function (iter_dataframe()) If integrated, it will indeed bee easy to check how is split the dataframe for different cases.

On the other hand, it's not causing any problems as it is, right? Maybe there was a wish to deal with zero-length dataframes?

As far as I can see, no it is not, but it makes the code more complex to read. I was trying to understand simply by figuring out in my mind, and it is not so direct and in the end, I do think that these max do not bring anything, even for a case of 'zero-length detafram' actually. As said, leaving open, and will check after we are done with PR #712.

yohplala avatar Dec 26 '21 17:12 yohplala