parquet-dotnet
parquet-dotnet copied to clipboard
Class deserializer should not require ParquetRequired attribute for required strings
Library Version
4.17
OS
macOS, Windows
OS Architecture
64 bit
How to reproduce?
I'm trying to use class deserialization, but it seems like there is a bug with regards to required strings.
The schema contains a required string field, lets call it MY_STR_FIELD
for ease. And no matter if I define my class with a string
or a string?
(or even System.String
) it fails with the same error:
System.IO.InvalidDataException
property 'MY_STR_FIELD' is declared as 'MY_STR_FIELD (System.String?)' but source data has it as 'MY_STR_FIELD (System.String)'
at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetRowGroupReader rg, ParquetSchema schema, Assembler`1 asm, ICollection`1 result, CancellationToken cancellationToken)
at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetReader reader, Int32 rgi, Assembler`1 asm, ICollection`1 result, CancellationToken cancellationToken)
at Parquet.Serialization.ParquetSerializer.DeserializeAsync[T](Stream source, ParquetOptions options, CancellationToken cancellationToken)
Which I find strange, because surely a string
fits into a string?
right?
I tried adjusting the compare method in DataField.cs
to take nullability into account and at least continue on with parsing, but that results in other errors being thrown when decoding.
The parquet file contains sensitive info so I'm not able to share it. I'm working on creating a dummy file that exhibits the same behaviour. For reference, the ParquetViewer Online is able to read this parquet and list the schema. There it became clear that MY_STR_FIELD is the only string field with Definition Level 0 in the schema.
I'm assuming this is synonymous to being 'required'. But again, this doesn't mean that my class MUST define non-nullable properties. And even if I did, the same error is raised.
Failing test
No response
I'm facing exactly the same problem. I have a class with all the properties and types of this parquet file. I started first setting them as non-mandatory (nullable) and then the serialization failed with the same message:
System.IO.InvalidDataException: property 'ForecastTimestamp' is declared as 'ForecastTimestamp (System.String?)' but source data has it as 'ForecastTimestamp (System.String)
Afterwards, I set all the properties of my class as mandatory (non-null) but I got the same message again.
Any clues on this?
@yolave my current workaround is to use the Table and Rows objects instead. Iterate over the table, for each row you can call .ToString() to get the JSON representation, and then simply JSON deserialize that.
Here's a snippet:
using var expectedReader = await ParquetReader.CreateAsync(expectedStream);
var expectedTable = await expectedReader.ReadAsTableAsync();
var expectedItems = expectedTable
.Select(x => x.ToString())
.Select(JsonConvert.DeserializeObject<T>)
.ToDictionary(x => x.GetPrimaryKey());
Thanks for sharing this workaround @rachied . I finally solved my problem after changing the schema of my parquet file. The original schema had all the properties as mandatory, so I changed to optional and then the Parquet library could deserialize it.
Maybe there's a problem in the library when trying to serialize a column and it's declared as mandatory in the parquet, because even declaring all the properties as non-null in a class, the library rises this exception up
You need to mark class field with [ParquetRequired]
attribute, as in .NET strings can have null values by default. For some reason this was not added to the documentation, but only to release notes.
Can we support such type of schema mismatch without the need of this attribute? Would you welcome PR that facilitates this?
We process parquet files from external parties using a common schema but the nullability can sometimes vary on those fields. The fact that .NET strings can always have null values should not matter when deserializing as I understand it. You'd only need an explicit schema when serialising to parquet format.
Fair enough, for deserialisation it shouldn't matter. Feel free to create a PR of course, will merge it asap.