arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Add support for writing offsets array for Parquet file writing

Open bmcdonald3 opened this issue 2 years ago • 3 comments

HDF5 has a save_offsets argument that makes it so, when writing a strings array, the offsets array is saved as well and can then be loaded later to make string reading faster, and that is enabled by the calc_strings_offset argument for reading.

This is not supported for Parquet string reading/writing today, but the calculation of the offsets array is not nearly as taxing for Parquet, since we have to calculate the byte sizes prior to reading the file anyway, the offsets array costs very little execution time to calculate.

This is being put off due to these reasons, but could be implemented if a request comes in (or we just want to have greater parity with the HDF5 IO).

bmcdonald3 avatar Mar 18 '22 17:03 bmcdonald3

@bmcdonald3 I think it's ok to not save the offsets for Parquet. We saved them initially for HDF5 and maybe we didn't really need to... what do you think @reuster986?

mhmerrill avatar May 02 '22 16:05 mhmerrill

Hmm, I thought that we were saving the offsets in HDF5...? I'm extremely uncomfortable with relying on nulls to mark the boundaries between strings, because user data could in principle contain nulls and break everything. But I'm probably just paranoid.

reuster986 avatar May 02 '22 20:05 reuster986

@reuster986 I believe we are saving offsets in HDF5 but I may be wrong. Are you worried about escaped null characters in UTF8 strings because I believe a null UTF8 character signals the end of a string... but I think I am wrong now ;-)

I found this: "In all modern character sets, the null character has a code point value of zero. In most encodings, this is translated to a single code unit with a zero value. For instance, in UTF-8 it is a single zero byte. However, in Modified UTF-8 the null character is encoded as two bytes: 0xC0, 0x80."

"The code 0x0000 is the Unicode string terminator for a null-terminated string. A single null byte is not sufficient for this code, because many Unicode characters contain null bytes as either the high or the low byte."

I am confused now...

mhmerrill avatar May 03 '22 13:05 mhmerrill

I believe we are intending to remove null termination moving forward. Although, this may present a lot of issues with old data. Based on user requests for Parquet to move away from functioning like HDF5, I am not sure this is something we should do. Although, I have looked at mimicking the idea of an HDF5 group using directories, but I am a little worried that this will be adding unnecessary complexities. @pierce314159 - do you have any thoughts on this?

Ethan-DeBandi99 avatar Nov 21 '22 18:11 Ethan-DeBandi99

Since we are writing the values as a column of ByteArrays, we could simply write another column to the file containing the offsets. This would present a layer of complexity with the naming or metadata configuration because we would need to be able to identify the values and segment and be able to ensure that we can access both. If this is still something we want to be able to do, it should not be a problem, but will add a layer of process that requires writing and reading 2 columns.

I still am not 100% sure that the complexity added is worth it unless the performance gain is significant enough since we are able to determine the offsets from the data. Additionally, this is true for SegArrays in parquet and we should be sure to maintain the same abilities for both.

Ethan-DeBandi99 avatar Apr 07 '23 19:04 Ethan-DeBandi99