parquet-go Different types returned in same scenario by Writer and Buffer

In a simple scenario where I create a schema with a single bytes column, read the written page and attempt to create a dictionary from the observed type in the column index, I will get a panic depending on which type of writer I used to write the data. My expectation is that regardless of the writer used, I will observe the same behavior.

Here is a test illustrating what I mean:

func TestWriter(t *testing.T) {
	node := parquet.String()
	node = parquet.Encoded(node, &parquet.RLEDictionary)
	g := parquet.Group{}
	g["mystring"] = node
	schema := parquet.NewSchema("test", g)

	rows := []parquet.Row{[]parquet.Value{parquet.ValueOf("hello").Level(0, 0, 0)}}

	var storage bytes.Buffer

	tests := []struct {
		name        string
		getRowGroup func(t *testing.T) parquet.RowGroup
	}{
		{
			name: "NormalWriter",
			getRowGroup: func(t *testing.T) parquet.RowGroup {
				t.Helper()

				w := parquet.NewWriter(&storage, schema)
				_, err := w.WriteRows(rows)
				require.NoError(t, err)
				require.NoError(t, w.Close())

				r := bytes.NewReader(storage.Bytes())
				f, err := parquet.OpenFile(r, int64(storage.Len()))
				require.NoError(t, err)
				return f.RowGroups()[0]
			},
		},
		{
			name: "Buffer",
			getRowGroup: func(t *testing.T) parquet.RowGroup {
				t.Helper()

				b := parquet.NewBuffer(schema)
				_, err := b.WriteRows(rows)
				require.NoError(t, err)
				return b
			},
		},
	}

	for _, testCase := range tests {
		t.Run(testCase.name, func(t *testing.T) {
			rowGroup := testCase.getRowGroup(t)

			chunk := rowGroup.ColumnChunks()[0]
			idx := chunk.ColumnIndex()
			val := idx.MinValue(0)
			columnType := chunk.Type()

			_ = columnType.NewDictionary(0, 1, columnType.NewValues(val.Bytes(), []uint32{0, uint32(len(val.Bytes()))}))
		})
	}
}

In the Buffer case, the test panics because NewValues creates int32 values since columnType is actually an indexedType (https://github.com/segmentio/parquet-go/blob/7efc157d28afda607e07e1f003e3c2c6922932df/dictionary.go#L1229), not a vanilla stringType as is the case with the Writer. However, NewDictionary passes through to the underlying string type, causing a type assertion error (again, only in the Buffer case):

panic: cannot convert values of type INT32 to type BYTE_ARRAY [recovered]
        panic: cannot convert values of type INT32 to type BYTE_ARRAY

goroutine 21 [running]:
testing.tRunner.func1.2({0x10524bfc0, 0x140002f8310})
        /opt/homebrew/opt/go/libexec/src/testing/testing.go:1396 +0x1c8
testing.tRunner.func1()
        /opt/homebrew/opt/go/libexec/src/testing/testing.go:1399 +0x378
panic({0x10524bfc0, 0x140002f8310})
        /opt/homebrew/opt/go/libexec/src/runtime/panic.go:884 +0x204
github.com/segmentio/parquet-go/encoding.(*Values).assertKind(...)
        /Users/asubiotto/go/pkg/mod/github.com/segmentio/[email protected]/encoding/values.go:56
github.com/segmentio/parquet-go/encoding.(*Values).ByteArray(0x14000080d28?)
        /Users/asubiotto/go/pkg/mod/github.com/segmentio/[email protected]/encoding/values.go:109 +0xb4
github.com/segmentio/parquet-go.newByteArrayDictionary({0x1053250c0?, 0x1057d2778}, 0x0, 0x0?, {0x2, 0x0, {0x1400019ff80, 0x5, 0x8}, {0x0, ...}})
        /Users/asubiotto/go/pkg/mod/github.com/segmentio/[email protected]/dictionary.go:680 +0x50
github.com/segmentio/parquet-go.(*stringType).NewDictionary(0x0?, 0x0, 0x1, {0x2, 0x0, {0x1400019ff80, 0x5, 0x8}, {0x0, 0x0, ...}})
        /Users/asubiotto/go/pkg/mod/github.com/segmentio/[email protected]/type.go:915 +0xb8

Aug 18 '22 13:08 asubiotto

Hi @asubiotto ! Thanks for opening this issue. I believe you have the right expectation and we should ensure the same behavior between the two type of writers.

Let us know if that is something you want to help us fix ! Otherwise we will try to work on it soon (next week if I have to guess).

Aug 18 '22 23:08 Pryz

Hi @Pryz, definitely happy to take a stab at it since it should be a good way to learn more about the parquet library (thanks for writing + maintaining btw). What would be most helpful to me would be to:

Understand the why behind indexedType. Why is it used and only in the Buffer case? Is it used anywhere else that I should be aware of? This is to help me understand whether any modifications I make here are within spec/desired behavior.
Where it makes sense to put a regression test for this bug.
If this issue is straightforward for you, what fix you would write for this bug based on your understanding of what's going on.

Thanks!

Aug 19 '22 06:08 asubiotto

Friendly ping

Aug 23 '22 08:08 asubiotto

parquet-go parquet-go copied to clipboard

Different types returned in same scenario by Writer and Buffer

parquet-go
parquet-go copied to clipboard