cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] `CompactProtocolFieldWriter` does not write empty value string in key-value pair

Open ttnghia opened this issue 2 years ago • 7 comments

During debugging a Parquet issue, I found that the Parquet metadata is not correctly written by cudf. In particular, in a round trip test, a pair of metadata {"key_str", ""} is written then read as {"key_str", null} in Spark. This is due to the CompactProtocolFieldWriter does not write empty value string:

size_t CompactProtocolWriter::write(KeyValue const& k)
{
  CompactProtocolFieldWriter c(*this);
  c.field_string(1, k.key);
  if (not k.value.empty()) { c.field_string(2, k.value); }                 <================ here
  return c.value();
}

We should write all value strings even for empty value.

However, I'm not sure if this is the expected behavior of compact protocol?

ttnghia avatar Aug 31 '23 18:08 ttnghia

The value in KeyValue is an optional field in the spec, so it's correct to write nothing if it isn't set. In #14000 I've started using std::optional for optional fields, to distinguish from merely empty but present fields. Perhaps that could be done throughout the Parquet thrift objects.

etseidl avatar Aug 31 '23 20:08 etseidl

@etseidl I assume that the key-value pair is optional, but is it okay to include the key without the value?

vuule avatar Aug 31 '23 20:08 vuule

@etseidl I assume that the key-value pair is optional, but is it okay to include the key without the value?

According to the thrift idl, yes.

 struct KeyValue {
  1: required string key
  2: optional string value
}

Also optional in FileMetaData

struct FileMetaData {
  ...
  /** Optional key/value metadata **/
  5: optional list<KeyValue> key_value_metadata
}

etseidl avatar Aug 31 '23 20:08 etseidl

That's interesting. Then such "optional" should mean the user can specify it as an optional, instead of the current behavior (the library ignores writing empty string without the user's knowledge). So #14000 should be the way to go instead.

ttnghia avatar Aug 31 '23 22:08 ttnghia

That's interesting. Then such "optional" should mean the user can specify it as an optional, instead of the current behavior (the library ignores writing empty string without the user's knowledge). So #14000 should be the way to go instead.

Yes, I think making more use of optional would make things clearer (although I didn't notice someone beat me to it :sweat_smile: @PointKernel apparently) https://github.com/rapidsai/cudf/blob/ad9fa501192332ca8ce310ffe967473ec0945a97/cpp/src/io/parquet/compact_protocol_reader.hpp#L265

etseidl avatar Aug 31 '23 23:08 etseidl

Yeah, using std::optional to properly handle optional field reading/writing was on my TODO list but then got distracted by other tasks.

PointKernel avatar Sep 05 '23 21:09 PointKernel

This issue has been at least partially addressed. We need to make another pass over CompactProtocolReader::read functions to verify that all optional fields are treated as such.

vuule avatar Feb 16 '24 00:02 vuule