ParquetViewer icon indicating copy to clipboard operation
ParquetViewer copied to clipboard

[BUG] Empty or null list(s) results in scrambled data

Open chris-branch opened this issue 1 year ago • 1 comments

Parquet Viewer Version 2.10.1.1

Where was the parquet file created? Parquet.NET

Description There is something wrong with the code that parses lists/arrays. If you have a column that is a list/array type, and you have rows where that column is either empty (i.e., 0 elements) or null, ParquetViewer shows the data mixed up across rows. Examples:

In all examples below, assume the following schema:

    internal class TestRow
    {
        public string Column1 { get; set; }
        public List<double> Column2 { get; set; }

        public TestRow(string column1, List<double> column2)
        {
            Column1 = column1;
            Column2 = column2;
        }
    }

Example 1: This has no nulls or empty values and works as expected:

    List<TestRow> data1 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double> { 1, 2, 3, 4, 5 }),
        new TestRow("Row 2", new List<double> { 6, 7, 8, 9, 10 }),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data1, @"sample1.parquet").Wait();

sample1

Example 2: This has an empty list in row 1 and results in scrambled data in rows 1-3

    List<TestRow> data2 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double>()),
        new TestRow("Row 2", new List<double> { 6, 7, 8, 9, 10 }),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data2, @"sample2.parquet").Wait();

sample2

Example 3: This has an empty list in row 2 and results in scrambled data in rows 2-3

    List<TestRow> data3 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double> { 1, 2, 3, 4, 5 }),
        new TestRow("Row 2", new List<double>()),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data3, @"sample3.parquet").Wait();

sample3

Sample files sample_parquets.zip

chris-branch avatar Sep 30 '24 15:09 chris-branch

Here is another example:

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([["dog", "cat"], [], None], type=pa.list_(pa.string()))
tbl = pa.table([arr], names=['animals'])

pq.write_table(tbl, "animals.parquet")
print(pq.read_table("animals.parquet").to_pandas())

image

None displays as [] in ParquetViewer: image

AndreiYachmeneu avatar Oct 03 '24 13:10 AndreiYachmeneu

I really appreciate the detailed examples, sample code, and sample files! It made solving this issue much easier. Please try out v3.2.0.0 which should handle null/empty Lists correctly.

It appears different parquet writers write the data slightly differently so I had to adjust the code to accommodate.

mukunku avatar Dec 23 '24 22:12 mukunku