ParquetViewer icon indicating copy to clipboard operation
ParquetViewer copied to clipboard

[BUG] Error when opening file with list field (System.InvalidCastException)

Open 0xSheller opened this issue 1 year ago • 8 comments

Parquet Viewer Version Latest (3.1.0)

Where was the parquet file created? pyarrow

Sample File Generic Array

Describe the bug Upon loading a file i get hit with:

---------------------------
Specified cast is not valid.
---------------------------
Something went wrong (CTRL+C to copy):

System.InvalidCastException: Specified cast is not valid.

   at ParquetViewer.Engine.ParquetEngine.ReadListField(DataTableLite dataTable, ParquetRowGroupReader groupReader, Int32 rowBeginIndex, ParquetSchemaElement itemField, Int32 fieldIndex, Int64 skipRecords, Int64 readRecords, Boolean isFirstColumn, CancellationToken cancellationToken, IProgress`1 progress)

   at ParquetViewer.Engine.ParquetEngine.ProcessRowGroup(DataTableLite dataTable, ParquetRowGroupReader groupReader, Int64 skipRecords, Int64 readRecords, CancellationToken cancellationToken, IProgress`1 progress)

   at ParquetViewer.Engine.ParquetEngine.PopulateDataTable(DataTableLite dataTable, ParquetReader parquetReader, Int64 offset, Int64 recordCount, CancellationToken cancellationToken, IProgress`1 progress)

   at ParquetViewer.Engine.ParquetEngine.ReadRowsAsync(List`1 selectedFields, Int32 offset, Int32 recordCount, CancellationToken cancellationToken, IProgress`1 progress)

   at ParquetViewer.MainForm.<>c__DisplayClass33_0.<<LoadFileToGridview>b__1>d.MoveNext()

--- End of stack trace from previous location ---

   at ParquetViewer.MainForm.LoadFileToGridview()

   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_0(Object state)

   at InvokeStub_SendOrPostCallback.Invoke(Object, Span`1)

   at System.Reflection.MethodBaseInvoker.InvokeWithOneArg(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
---------------------------
OK   
---------------------------

0xSheller avatar Sep 23 '24 21:09 0xSheller

Can you share a sample file please? You can zip and upload it here. Otherwise I might not be able to help much.

Also give version 3.2.0.0 a try. I doubt it has addressed your issue but wanted to suggest just in case 🤞🏼 .

mukunku avatar Dec 23 '24 22:12 mukunku

I believe this is due to list fields with nested struct values. Implementing it has proven challenging.

mukunku avatar Jan 09 '25 16:01 mukunku

Maybe related. I'm, getting the following error. And yes, it is parquet with nested struct.

---------------------------
Something went wrong
---------------------------
Could not load parquet file.

If the problem persists please consider opening a bug ticket in the project repo: Help → About

ParquetViewer.Engine.Exceptions.FileReadException: Encountered an error reading file.
 ---> System.InvalidOperationException: don't know how to skip type Set
   at Parquet.Meta.Proto.ThriftCompactProtocolReader.SkipField(CompactType compactType)
   at Parquet.Meta.ColumnChunk.Read(ThriftCompactProtocolReader proto)
   at Parquet.Meta.RowGroup.Read(ThriftCompactProtocolReader proto)
   at Parquet.Meta.FileMetaData.Read(ThriftCompactProtocolReader proto)
   at Parquet.ParquetActor.ReadMetadataAsync(CancellationToken cancellationToken)
   at Parquet.ParquetReader.InitialiseAsync(CancellationToken cancellationToken)
   at Parquet.ParquetReader.CreateAsync(String filePath, ParquetOptions parquetOptions, CancellationToken cancellationToken)
   at ParquetViewer.Engine.ParquetEngine.OpenFileAsync(String parquetFilePath, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at ParquetViewer.Engine.ParquetEngine.OpenFileAsync(String parquetFilePath, CancellationToken cancellationToken)
   at ParquetViewer.MainForm.OpenFieldSelectionDialog(Boolean forceOpenDialog)

toomyem avatar Feb 04 '25 14:02 toomyem

@toomyem Which version of the app are you using? Have you tried v3.2.1.0 ?

mukunku avatar Feb 04 '25 15:02 mukunku

Nope, still on 3.2.0 as it is latest released.

toomyem avatar Feb 04 '25 15:02 toomyem

I just checked 3.2.1.0 - works much better and opens my file correctly 👍 Thank you.

toomyem avatar Feb 04 '25 15:02 toomyem

After some testing, it looks like it is not handled 100% correct.

Example schema:

message test-msg {
  required int32 data_required32;
  required int64 data_required64;
  optional binary data_optional_missing (UTF8);
  optional binary data_optional_existing (UTF8);
  repeated int32 data_repeated;
  repeated group data_group {
    required binary nested (UTF8);
  }

  required binary uuid (UTF8);

  required group data_list (LIST) {
    repeated group list {
      required binary element (UTF8);
    }
  }
}

Test parquet file: test.zip Parquet file without repetition: test-no-repeat.zip

Error while opening:

---------------------------
Specified cast is not valid.
---------------------------
Something went wrong (CTRL+C to copy):

System.InvalidCastException: Specified cast is not valid.
   at System.Data.Common.Int32Storage.Set(Int32 record, Object value)
   at System.Data.DataColumn.set_Item(Int32 record, Object value)

Problem occurs when there is more than one repetition for field repeated int32 data_repeated;

Works:

 record.add("data_repeated", 10);

Fails:

 record.add("data_repeated", 10);
 record.add("data_repeated", 20);

Moreover, required group data_list (LIST) is being read as empty list ([]): Image

Snippet used to create parquet file:

try (InputStream inputStream = MessageTypeProvider.class.getResourceAsStream("/test-msg.message")) {
            String schema = new String(inputStream.readAllBytes(), StandardCharsets.UTF_8);
            MessageType msgType = MessageTypeParser.parseMessageType(schema);
            try (ParquetWriter<Group> writer = new ParquetWriterBuilder(Path.of("test.parquet"), msgType).withWriteMode(OVERWRITE).build()) {
                Group record = new SimpleGroupFactory(msgType).newGroup();
                record.add("uuid", UUID.randomUUID().toString());
                record.add("data_required32", 32);
                record.add("data_required64", 64L);
                record.add("data_optional_existing", "hello");
                record.add("data_repeated", 10);
                record.add("data_repeated", 20);
                Group data = record.addGroup("data_group");
                data.add("nested", "nested!");
                Group list = record.addGroup("data_list");
                Group el1 = list.addGroup("list");
                el1.add("element", "element1");
                Group el2 = list.addGroup("list");
                el2.add("element", "element2");
                writer.write(record);
            }
        }

toomyem avatar Feb 05 '25 13:02 toomyem

I'm using v3.2.1 and getting this error:

---------------------------
Unable to cast object of type 'ParquetViewer.Engine.Types.StructValue' to type 'ParquetViewer.Engine.Types.ListValue'.
---------------------------
Something went wrong (CTRL+C to copy):

System.InvalidCastException: Unable to cast object of type 'ParquetViewer.Engine.Types.StructValue' to type 'ParquetViewer.Engine.Types.ListValue'.

   at ParquetViewer.Engine.ParquetEngine.ReadListField(DataTableLite dataTable, ParquetRowGroupReader groupReader, Int32 rowBeginIndex, ParquetSchemaElement itemField, Int32 fieldIndex, Int64 skipRecords, Int64 readRecords, Boolean isFirstColumn, CancellationToken cancellationToken, IProgress`1 progress)

   at ParquetViewer.Engine.ParquetEngine.ProcessRowGroup(DataTableLite dataTable, ParquetRowGroupReader groupReader, Int64 skipRecords, Int64 readRecords, CancellationToken cancellationToken, IProgress`1 progress)

   at ParquetViewer.Engine.ParquetEngine.PopulateDataTable(DataTableLite dataTable, ParquetReader parquetReader, Int64 offset, Int64 recordCount, CancellationToken cancellationToken, IProgress`1 progress)

   at ParquetViewer.Engine.ParquetEngine.ReadRowsAsync(List`1 selectedFields, Int32 offset, Int32 recordCount, CancellationToken cancellationToken, IProgress`1 progress)

   at ParquetViewer.MainForm.<>c__DisplayClass33_0.<<LoadFileToGridview>b__1>d.MoveNext()

--- End of stack trace from previous location ---

   at ParquetViewer.MainForm.LoadFileToGridview()

   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_0(Object state)

   at InvokeStub_SendOrPostCallback.Invoke(Object, Span`1)

   at System.Reflection.MethodBaseInvoker.InvokeWithOneArg(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
---------------------------
OK   
---------------------------

Opening a parquet file with following schema:

 |--...
 |-- experience: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- duration: string (nullable = true)
 |    |    |-- end_time: string (nullable = true)
 |    |    |-- job_description: string (nullable = true)
 |    |    |-- location: string (nullable = true)
 |    |    |-- position: string (nullable = true)
 |    |    |-- start_time: string (nullable = true)
 |    |    |-- company: struct (nullable = false)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- image_url: string (nullable = true)
 |    |    |    |-- social_url: string (nullable = true)
 |-- ...

The file does open when removing the nested struct.

kdeclerck avatar Mar 13 '25 14:03 kdeclerck

Test parquet file: test.zip Parquet file without repetition: test-no-repeat.zip

I was able to address the issues with the data_list and data_repeated fields in v3.5.0. However the issue with data_group needs more investigation. I opened this issue in the parquet-dotnet repo to track it: https://github.com/aloneguid/parquet-dotnet/issues/681

mukunku avatar Nov 23 '25 04:11 mukunku

Test parquet file: test.zip Parquet file without repetition: test-no-repeat.zip

I was able to address the issues with the data_list and data_repeated fields in v3.5.0. However the issue with data_group needs more investigation. I opened this issue in the parquet-dotnet repo to track it: aloneguid/parquet-dotnet#681

https://github.com/aloneguid/parquet-dotnet/issues/681 was fixed and released 🥳 So I updated v3.5.0 with the latest pre-release which addresses the remaining issue.

With all test file issues addressed I'm going to close this ticket out but feel free to open a new one, folks, with a new test file 🙏🏾

mukunku avatar Nov 27 '25 13:11 mukunku