TileDB-Py icon indicating copy to clipboard operation
TileDB-Py copied to clipboard

Fix stacked sparse CSV ingestion

Open johnkerl opened this issue 6 months ago • 0 comments

https://linear.app/tiledb/issue/CORE-276/tiledbfrom-csv-ignores-row-start-idx-when-sparse

This PR fixes the bug shown below. Also, it adds heretofore-missing unit-test coverage for sparse tiledb.from_csv with row-indexing (i.e., index_col=None).

Repro data:

$ cat a.csv
a,b,c
1,2,3

$ cat b.csv
a,b,c
4,5,6
7,8,9

Repro script:

#!/usr/bin/env python

import tiledb
import os
import shutil

uri = 'dense-no-dupes'
if os.path.exists(uri):
    shutil.rmtree(uri)
tiledb.from_csv(uri, 'a.csv', sparse=False, mode='schema_only', full_domain=True, allows_duplicates=False)
tiledb.from_csv(uri, 'a.csv', mode='append', row_start_idx=0)
tiledb.from_csv(uri, 'b.csv', mode='append', row_start_idx=1)
with tiledb.open(uri) as A:
    print()
    print(f"URI={uri}")
    print(A.df[:])

uri = 'sparse-no-dupes'
if os.path.exists(uri):
    shutil.rmtree(uri)
tiledb.from_csv(uri, 'a.csv', sparse=True, mode='schema_only', full_domain=True, allows_duplicates=False)
tiledb.from_csv(uri, 'a.csv', mode='append', row_start_idx=0)
tiledb.from_csv(uri, 'b.csv', mode='append', row_start_idx=1)
with tiledb.open(uri) as A:
    print()
    print(f"URI={uri}")
    print(A.df[:])

uri = 'sparse-dupes'
if os.path.exists(uri):
    shutil.rmtree(uri)
tiledb.from_csv(uri, 'a.csv', sparse=True, mode='schema_only', full_domain=True, allows_duplicates=True)
tiledb.from_csv(uri, 'a.csv', mode='append')
tiledb.from_csv(uri, 'b.csv', mode='append')
with tiledb.open(uri) as A:
    print()
    print(f"URI={uri}")
    print(A.df[:])

Output before this PR:

URI=dense-no-dupes
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

URI=sparse-no-dupes
   a  b  c
0  4  5  6
1  7  8  9

URI=sparse-dupes
   a  b  c
0  1  2  3
0  4  5  6
1  7  8  9

The issue is that row_start_idx is ignored in the sparse case.

The workaround of setting allow_duplicates=True is unsatisfactory since this disallows replacement updates. For example, setting a=20,b=30,c=40 at row index 0 would result in a third row in the sparse-dupes example, not replacing any existing values.

Output with this PR:

URI=dense-no-dupes
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

URI=sparse-no-dupes <---------- fixed
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

URI=sparse-dupes
   a  b  c
0  1  2  3
0  4  5  6
1  7  8  9

johnkerl avatar Jun 25 '25 16:06 johnkerl