opteryx icon indicating copy to clipboard operation
opteryx copied to clipboard

✨ concat tables to create null columns

Open joocer opened this issue 9 months ago • 0 comments

If you concatenate a table with 100 rows and 3 columns with a table with 0 rows and 5 columns using pyarrow.concat_tables, the result will be as follows:

What Happens:

  1. Column Matching by Name:

Columns are matched by name across both tables.

If a column is present in one table but not the other, it is filled with null values in the table that lacks it.

  1. Schema Unification:

The resulting table will have columns from both tables.

Columns unique to the second table (0 rows) will be added to the first table with null values.

  1. Row Count:

The resulting table will have the total number of rows from all concatenated tables, which in this case is 100 rows (since the second table has 0 rows).

Example:

import pyarrow as pa

Table with 100 rows and 3 columns

table1 = pa.table({ 'a': list(range(100)), 'b': [x * 2 for x in range(100)], 'c': [x % 2 for x in range(100)] })

Table with 0 rows and 5 columns

table2 = pa.table({ 'a': pa.array([], type=pa.int32()), 'b': pa.array([], type=pa.int32()), 'c': pa.array([], type=pa.int32()), 'd': pa.array([], type=pa.string()), 'e': pa.array([], type=pa.float32()) })

Concatenate tables

result = pa.concat_tables([table1, table2])

print(result.schema) print(result)

Output:

a: int64 b: int64 c: int64 d: string e: float32

a: [[0, 1, 2, ..., 99]] b: [[0, 2, 4, ..., 198]] c: [[0, 1, 0, ..., 1]] d: [[null, null, ..., null]] e: [[null, null, ..., null]]

Key Observations:

  1. Schema Resolution:

The resulting schema is a union of the schemas from both tables.

Columns d and e are added from table2, even though they don't exist in table1.

  1. Filling with Nulls:

Columns d and e are filled with null values in table1's rows.

Columns a, b, and c are unchanged and retain their values.

  1. Row Count:

The resulting table has 100 rows, matching the non-empty table's row count.

Why This Happens:

pyarrow.concat_tables aligns the columns by name and fills in null values where columns are missing.

This allows you to concatenate tables with different schemas without losing any columns.

Practical Use Case:

This behavior is useful when:

You have sparse data where not all columns are present in every partition.

You're combining data from multiple sources with slightly different schemas.

Would you like more details or examples?

joocer avatar Feb 24 '25 17:02 joocer