parquet-dotnet
parquet-dotnet copied to clipboard
Parquet.NET reads wrong values from file generated with Athena
Version: Parquet.Net v3.9.1 (also reproduced in 3.8.5 and 3.6.0)
Runtime Version: .Net Core v 3.1
OS: Windows
Expected behavior
The file I try to read is generated with AWS Athena. One of the columns is of type int?
and contains only the values 0, 1, 3 and NULL. I expect that reading the column with Parquet.NET returns me the exact same values that I see when querying the file using Athena.
Actual behavior
For some reason, our tool, using Parquet.NET returns different values for this (only this) column from what Athena reports. We have cross-tested reading the single file with pyarrow + pandas in python, parquetjs-lite on nodejs and ParquetSharp on .NET 3.1. All report the values identical to Athena.
Steps to reproduce the behavior
I have attached a zip file. It contains a parquet file and an XUnit test that demonstrates the behavior. ReproduceParquetIssue.zip
Code snippet reproducing the behavior
See code + parquet files attached
Apologies in advance
Not returning the correct data seems like such a big issue in a library like this that I still cannot really believe our findings. I'm sure we must be doing something wrong. If so, can you please point out where we are doing it wrong?
Some extra info that may help exclude some possible causes: the file I attached is a subset of the columns in the original file (removing PII and other irrelevant data). The columns where also renamed. These steps did not make the problem go away. All other columns (int, int?, bool?, string) worked just fine.
I think this may be a duplicate of #164
One more clue: the data in this column is stored in pages of 20.000 values. The first 20.000 values come out just fine. Differences start to appear immediately after the start of the second page. I have tried to debug, but I just don't know enough of the underlying formats. I have been stepping through and comparing with the same actions in parquetjs-lite and haven't been able to find a difference yet.
Closing due to no activity
@aloneguid I am a bit surprised that you close this issue. It is a real bug and the PR I provided is waiting for your approval to run tests. The first run found an issue specific to OSX, which I fixed. I pushed those changes 19 days ago and they were approved by user kzrodlowski.
@aloneguid I am a bit surprised that you close this issue. It is a real bug and the PR I provided is waiting for your approval to run tests. The first run found an issue specific to OSX, which I fixed. I pushed those changes 19 days ago and they were approved by user kzrodlowski.
@aloneguid I am a bit surprised that you close this issue. It is a real bug and the PR I provided is waiting for your approval to run tests. The first run found an issue specific to OSX, which I fixed. I pushed those changes 19 days ago and they were approved by user kzrodlowski.