cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] Parquet column selection by name with schemas including list<struct<X, Y>> does not work.

Open nvdbaranec opened this issue 2 years ago • 5 comments

If you have a schema that contains a list-of-struct, selecting a subset of the inner columns doesn't work. Example

list<struct<int, float>> If the schema for this column was

A           (list)
   B        (struct)
       C    (int)
       D    (float)

Attempting to select "A.B.C" would not work. I believe this is being caused by some schema preprocessing that we are doing that is injecting fake schema elements to ease schema interpretation. Essentially we see a schema that looks like this:

A            (list)
  list       (the fake element
     B       (struct)
        C    (int)
        D    (float)

So "A.B.C" doesn't actually exist, only "A.list.B.C" and the code returns 0 columns.

nvdbaranec avatar Nov 30 '23 21:11 nvdbaranec

Actually, upon further review, this mystery "list" element is in the parquet file itself (it's one of the odd ways in which the spec allows you to specify list columns). A question here though would be what would a user expect to be the correct way to do it. For Pandas or Spark, would you expect to have to put "list" in there when selecting a subset of columns? @jlowe @shwina

nvdbaranec avatar Nov 30 '23 22:11 nvdbaranec

The schema for this part of the file is

  optional group field_id=-1 func_params (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 item {
        optional int32 field_id=-1 order;
        optional int32 field_id=-1 size;
        optional binary field_id=-1 type (String);
      }
    }
  }

hyperbolic2346 avatar Nov 30 '23 22:11 hyperbolic2346

Unfortunately unless you can normalize the schema it is not clear because there are multiple ways to encode the schema and it is not "required"

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

Ideally the repeated group is called "list" but

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules

gives a lot of other options

revans2 avatar Nov 30 '23 22:11 revans2

Right, that's the question: are these details something you'd expect the end user to know or care about, or would they just expect "A.B.C"? Maybe this is a what-would-Pandas-do question.

nvdbaranec avatar Nov 30 '23 22:11 nvdbaranec

@nvdbaranec It's been a few years, but I believe the way to query in the above situation is to use explode to convert the list to separate rows. If there were another column at the top of the hierarchy ('X'), then the value for 'X' would be repeated for each new row that the list 'A' was exploded into. Here's a pyspark query I did years ago against the data @hyperbolic2346 quoted above:

df.createOrReplaceTempView("asm")

sql = """
select func_name, func_addr_start, blk_addr_start, blk_id, flatten(sources.asm) as asm from (
  select func_name, func_addr_start, bb.blk_addr_start, bb.blk_id, filter(bb.sources,x->x.asm_scrub_type = 'no_scrub') as sources
    from (select func_name, func_addr_start, explode(basic_blocks) as bb from asm))
where func_name='introduce'
""" 

etseidl avatar Dec 05 '23 00:12 etseidl