parquet-dotnet icon indicating copy to clipboard operation
parquet-dotnet copied to clipboard

Exception "cannot find data type handler to create model schema for [n: CONTACT_ID, t: BYTE_ARRAY, ct: DECIMAL, rt: OPTIONAL, c: 0]"

Open jchristn opened this issue 3 years ago • 1 comments

Version: Parquet.Net v3.9.1

Runtime Version: .Net 5.0

OS: Windows 10

Expected behavior

I would expect Parquet.Net to identify the DataTypeHandler appropriately for the column in question.

Actual behavior

Parquet.Net is throwing the aforementioned exception when I access parquetReader.Schema.

From ParquetViewer, the column in question is:

  "Schema": [
    {
      "Field_id": 0,
      "Name": "CONTACT_ID",
      "Type": "BYTE_ARRAY",
      "Type_length": 0,
      "LogicalType": null,
      "Scale": 4,
      "Precision": 23,
      "Repetition_type": "OPTIONAL",
      "Converted_type": "DECIMAL"
    },

Steps to reproduce the behavior

  1. Step 1 - open a ParquetReader against the file.
  2. Step 2 - access ParquetReader.Schema when one of the columns is as described above

Code snippet reproducing the behavior

using (Stream fileStream = System.IO.File.OpenRead(fileName))
{
   using (var parquetReader = new ParquetReader(fileStream))
   {
      Schema schema = parquetReader.Schema;
   }
}

I am unfortunately unable to share a copy of the file in question. If there is a tool that will allow me to extract a subset easily, I could likely share a file with just this column.

jchristn avatar Jun 02 '22 21:06 jchristn

Sorry to pester, is this a known issue? Am I perhaps doing something wrong? Cheers

jchristn avatar Jun 21 '22 18:06 jchristn

This seems like a newer parquet format addition. Your column is using variable-size byte array to represent decimals. The specification allows for 4 representations, and parquet.net implements the first 3.

No need to attach test files, but I'd appreciate validating the fix when you have a chance.

aloneguid avatar Jan 11 '23 11:01 aloneguid

Sorry for the delay. I updated to 4.2.2 and still am receiving the error.

  "Exception": {
    "ClassName": "System.InvalidOperationException",
    "Message": "cannot find data type handler to create model schema for [n: CONTACT_ID, t: BYTE_ARRAY, ct: DECIMAL, rt: OPTIONAL, c: 0]",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   at Parquet.File.ThriftFooter.CreateModelSchema(FieldPath path, IList`1 container, Int32 childCount, Int32& si, ParquetOptions formatOptions)\r\n   at Parquet.File.ThriftFooter.CreateModelSchema(ParquetOptions formatOptions)\r\n   at Parquet.ParquetReader.get_Schema()\r\n   

jchristn avatar Jan 11 '23 18:01 jchristn

Not ready yet )

aloneguid avatar Jan 11 '23 19:01 aloneguid

My bad, I'll wait patiently :)

jchristn avatar Jan 11 '23 19:01 jchristn

Thanks @jchristn. This is supported in 4.2.3 if you could validate and confirm. I've used one of the official test data files from parquet repo to validate this, however couldn't force Spark to write one like that, so wondering what system did produce data in this format?

aloneguid avatar Jan 16 '23 11:01 aloneguid

Hi @aloneguid I'm not sure what system was used to create the file :( When I try with v4.2.3, this is displayed to the console: RUNTIME::: win10-x64 SEARCHPATH::: and after, it seems to work.

jchristn avatar Jan 17 '23 15:01 jchristn

Some debug logging to remove )

aloneguid avatar Jan 17 '23 17:01 aloneguid

There still seems to be an issue with one file in particular, any way I could PM you or email with more details?

jchristn avatar Jan 17 '23 18:01 jchristn

Email me [email protected].

aloneguid avatar Jan 17 '23 18:01 aloneguid

Issue resolved in v4.3.3

jchristn avatar Jan 21 '23 16:01 jchristn