parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

Open asfimport opened this issue 6 years ago • 13 comments

When using the Avro schema below to write a parquet(1.8.1) file and then read back by using parquet 1.10.1 without passing any schema, the reading throws an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 

           {

              "name": "phones",

              "type": [

                "null",

                {

                  "type": "array",

                  "items": {

                    "type": "record",

                    "name": "phones_items",

                    "fields": [

                      

{                         "name": "phone_number",                         "type": [                           "null",                           "string"                         ],                         "default": null                       }

                    ]

                  }

                }

              ],

              "default": null

            }

The code to read is as below 

     val reader = AvroParquetReader.builderSomeRecordType.withConf(new   Configuration).build()

    reader.read()

PARQUET-651 changed the method isElementType() by relying on Avro's checkReaderWriterCompatibility() to check the compatibility. However, checkReaderWriterCompatibility() consider the ParquetSchema and the AvroSchema(converted from File schema) as not compatible(the name in avro schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence not compatible) . Hence return false and caused the “phone_number” field in the above schema to be considered as group type which is not true. Then the exception throws as .asGroupType(). 

I didn’t try writing via parquet 1.10.1 would reproduce the same problem or not. But it could because the translation of Avro schema to Parquet schema is not changed(didn’t verify yet). 

 I hesitate to revert PARQUET-651 because it solved several problems. I would like to hear the community's thoughts on it. 

Reporter: Xinli Shang / @shangxinli Assignee: Xinli Shang / @shangxinli

Related issues:

Note: This issue was originally created as PARQUET-1681. Please see the migration documentation for further details.

asfimport avatar Oct 18 '19 19:10 asfimport

Gabor Szadovszky / @gszadovszky: I am removing the target 1.11.0 as this is an improvement and not a regression since 1.10.0.

asfimport avatar Nov 07 '19 08:11 asfimport

Xinli Shang / @shangxinli: Hi @rdblue, do you still remember or document it somewhere what are the several problems that  PARQUET-651 solved? Is SPARK-16344 the only problem sovled? I am evaluating should we roll back PARQUET-651  or roll forward. 

asfimport avatar Nov 07 '19 18:11 asfimport

Ryan Blue / @rdblue: The Avro check should ignore record names if the record is the root. Has this check changed in Avro recently?

asfimport avatar Nov 07 '19 18:11 asfimport

Ryan Blue / @rdblue: I think we should be able to work around this instead of reverting PARQUET-651. If the compatibility check requires that the name matches, then we should be able to ensure that the name matches when converting the Parquet schema to Avro.

asfimport avatar Nov 07 '19 18:11 asfimport

Ryan Blue / @rdblue: Looks like it might be https://issues.apache.org/jira/browse/AVRO-2400.

asfimport avatar Nov 07 '19 18:11 asfimport

Fokko Driesprong / @Fokko: @rdblue AVRO-2400 was a huge regression bug indeed. But this is with Avro 1.9.x which hasn't been released with Parquet. Parquet 1.10.1 still runs on Avro 1.8.2. Parquet 1.11.0 is the first version that targets Avro 1.9.1.

This ticket states that 1.9.1 also is affected, but this bug has been resolved in there. [[email protected]] Did you check with 1.9.1 as well? Would it be possible to deduct a unit test? So we can reproduce the bug in the CI

asfimport avatar Nov 08 '19 07:11 asfimport

Xinli Shang / @shangxinli: @rdblue, I verified that this issue is not related with avro-2400 by reverting the avro version to 1.8.2 which still reproduced. 

@Fokko, I checked the Parquet-1.9.1. It seems PARQUET-651  has been there in 1.9.1. So it should be affected but not verified it yet. 

I just wrote a small unit test that can reproduce the issue and I also came up with a change(might need to be polished, just for proof of concept now) that solved the issue https://github.com/shangxinli/parquet-mr/commit/f80469f55b83404ea334ee4019f658ecdb5ac575. I have explanation below for this fix. But this change won't fix the parquet data files that were already written without this fix. Reverting -PARQUET-651-  (with a small fix(https://github.com/shangxinli/parquet-mr/commit/10875f5b634466e1a58a615bb21f9af426275247), which I believe it is a bug in 1.8.1) can solve it ultimately.

So I would like to hear the community's thoughts on which route we want to go? 

Here are explanations of the issue and the fix. 

Root cause: org.apache.parquet.avro.AvroSchemaConverter() is a lossy conversion. Consider the avro schema below. 

{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[ {"name":"flatrecord","type":"int"},{"name":"nestedrecord","type":{"type":"record","name":"ignored","namespace":"","fields":[{"name":"nestedint1","type":"int"},{"name":"nestedint2","type":"int"}]}}]}

When I convert this avro schema to parquet schema and then convert it back avro schema, you can see "ignored" is lost and replaced with "nestedrecord" which is it's outlayer name. This is why checkReaderWriterCompatibility() return false and caused the issue I mentioned originally. This is the avro schema converted back from parquet schema. 

{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[{"name":"flatrecord","type":"int"},{"name":"nestedrecord","type":{"type":"record","name":"nestedrecord","namespace":"","fields": [{"name":"nestedint1","type":"int"},{"name":"nestedint2","type":"int"}]}}]}

The change I mentioned above was to extend the parquet schema with a metadata field. This metadat field can be used for general purpose. In this case we use it to contain the inner layer and restore it when the schema is convert it back to avro schema. 

 

With the change, the converted avro schema is exactly as the original one. 

{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[{"name":"flatrecord","type":"int"},{"name":"nestedrecord","type":{"type":"record","name":"ignored","namespace":"","fields":[{"name":"nestedint1","type":"int"},{"name":"nestedint2","type":"int"}]}}]}

 

This bug is pretty severe that it caused the data cannot be read when it is hit. I don't know why nobody reported this issue in the industry, maybe because this is a corner case or Parquet-1.10.1 is not used widely yet in production. But once it is hit, it is severe enough and blocks the business. 

 

 

asfimport avatar Dec 18 '19 18:12 asfimport

Sean Colgan: Thanks to all of you for producing great software! :D

I'd just like to comment on this ticket that this appears to be the issue forcing us to exclude some integration test cases from our parquet writer. We have a system that consumes Avro, transforms it, and uses this library to emit Parquet.  It has some tests that read in that Parquet back to Avro, to validate the transformations.  We currently exclude some test cases since the parquet -> avro reader blows up with this error.

asfimport avatar Aug 18 '20 00:08 asfimport

Andreas Hailu / @hailuand: Hi folks, we're running in to this issue as well. Bob Smith was able to provide a unit test in 2018 that reproduces this issue in PARQUET-1254.

I personally like [[email protected]]'s idea of having some sort of 'metadata' field to store information like the list type rather than mangling the actual schema, so that the conversion Avro <-> Parquet schema is always compatible as to not create any unforeseen pitfalls.

asfimport avatar Jun 14 '21 14:06 asfimport

Xinli Shang / @shangxinli: We chose to revert the behavior back to 1.8.1. It runs fine for a year or so. We will port the changes soon.

asfimport avatar Jul 01 '21 15:07 asfimport

Timothy Miller / @theosib-amazon: Is this related to PARQUET-2069? It looks like it might be.

asfimport avatar Apr 08 '22 16:04 asfimport

Xinli Shang / @shangxinli: @theosib-amazon It seems different.

asfimport avatar Apr 08 '22 16:04 asfimport

Timothy Miller / @theosib-amazon: Have a look at my further analysis of PARQUET-2069. I don't think 1681 will be fixed by what I did to fix 2069, but the problem seems to be the same KIND of problem, happening in the same place. Basically, isElementType is returning the wrong thing because there are types that Avro should think are compatible, but it has not been properly informed about that. In the case of 2069, "list" and "array" were being considered incompatible, so I fixed it by hacking isElementType to inform Avro of that compatibility. It may be that we need to completely rethink isElementType to be smarter about detecting reader/writer compatibility. In the case of 1681, evidently "phones_items" is some kind of list/array type, so it should test positive as compatible with another list/array type simply by virtue of them both containing multiple members.

asfimport avatar Apr 21 '22 15:04 asfimport