[BUG] Empty or null list(s) results in scrambled data
Parquet Viewer Version 2.10.1.1
Where was the parquet file created? Parquet.NET
Description There is something wrong with the code that parses lists/arrays. If you have a column that is a list/array type, and you have rows where that column is either empty (i.e., 0 elements) or null, ParquetViewer shows the data mixed up across rows. Examples:
In all examples below, assume the following schema:
internal class TestRow
{
public string Column1 { get; set; }
public List<double> Column2 { get; set; }
public TestRow(string column1, List<double> column2)
{
Column1 = column1;
Column2 = column2;
}
}
Example 1: This has no nulls or empty values and works as expected:
List<TestRow> data1 = new List<TestRow>
{
new TestRow("Row 1", new List<double> { 1, 2, 3, 4, 5 }),
new TestRow("Row 2", new List<double> { 6, 7, 8, 9, 10 }),
new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
};
ParquetSerializer.SerializeAsync(data1, @"sample1.parquet").Wait();
Example 2: This has an empty list in row 1 and results in scrambled data in rows 1-3
List<TestRow> data2 = new List<TestRow>
{
new TestRow("Row 1", new List<double>()),
new TestRow("Row 2", new List<double> { 6, 7, 8, 9, 10 }),
new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
};
ParquetSerializer.SerializeAsync(data2, @"sample2.parquet").Wait();
Example 3: This has an empty list in row 2 and results in scrambled data in rows 2-3
List<TestRow> data3 = new List<TestRow>
{
new TestRow("Row 1", new List<double> { 1, 2, 3, 4, 5 }),
new TestRow("Row 2", new List<double>()),
new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
};
ParquetSerializer.SerializeAsync(data3, @"sample3.parquet").Wait();
Sample files sample_parquets.zip
Here is another example:
import pyarrow as pa
import pyarrow.parquet as pq
arr = pa.array([["dog", "cat"], [], None], type=pa.list_(pa.string()))
tbl = pa.table([arr], names=['animals'])
pq.write_table(tbl, "animals.parquet")
print(pq.read_table("animals.parquet").to_pandas())
None displays as [] in ParquetViewer:
I really appreciate the detailed examples, sample code, and sample files! It made solving this issue much easier. Please try out v3.2.0.0 which should handle null/empty Lists correctly.
It appears different parquet writers write the data slightly differently so I had to adjust the code to accommodate.