TileDB-Py
TileDB-Py copied to clipboard
Fix stacked sparse CSV ingestion
https://linear.app/tiledb/issue/CORE-276/tiledbfrom-csv-ignores-row-start-idx-when-sparse
This PR fixes the bug shown below. Also, it adds heretofore-missing unit-test coverage for sparse tiledb.from_csv with row-indexing (i.e., index_col=None).
Repro data:
$ cat a.csv
a,b,c
1,2,3
$ cat b.csv
a,b,c
4,5,6
7,8,9
Repro script:
#!/usr/bin/env python
import tiledb
import os
import shutil
uri = 'dense-no-dupes'
if os.path.exists(uri):
shutil.rmtree(uri)
tiledb.from_csv(uri, 'a.csv', sparse=False, mode='schema_only', full_domain=True, allows_duplicates=False)
tiledb.from_csv(uri, 'a.csv', mode='append', row_start_idx=0)
tiledb.from_csv(uri, 'b.csv', mode='append', row_start_idx=1)
with tiledb.open(uri) as A:
print()
print(f"URI={uri}")
print(A.df[:])
uri = 'sparse-no-dupes'
if os.path.exists(uri):
shutil.rmtree(uri)
tiledb.from_csv(uri, 'a.csv', sparse=True, mode='schema_only', full_domain=True, allows_duplicates=False)
tiledb.from_csv(uri, 'a.csv', mode='append', row_start_idx=0)
tiledb.from_csv(uri, 'b.csv', mode='append', row_start_idx=1)
with tiledb.open(uri) as A:
print()
print(f"URI={uri}")
print(A.df[:])
uri = 'sparse-dupes'
if os.path.exists(uri):
shutil.rmtree(uri)
tiledb.from_csv(uri, 'a.csv', sparse=True, mode='schema_only', full_domain=True, allows_duplicates=True)
tiledb.from_csv(uri, 'a.csv', mode='append')
tiledb.from_csv(uri, 'b.csv', mode='append')
with tiledb.open(uri) as A:
print()
print(f"URI={uri}")
print(A.df[:])
Output before this PR:
URI=dense-no-dupes
a b c
0 1 2 3
1 4 5 6
2 7 8 9
URI=sparse-no-dupes
a b c
0 4 5 6
1 7 8 9
URI=sparse-dupes
a b c
0 1 2 3
0 4 5 6
1 7 8 9
The issue is that row_start_idx is ignored in the sparse case.
The workaround of setting allow_duplicates=True is unsatisfactory since this disallows replacement updates. For example, setting a=20,b=30,c=40 at row index 0 would result in a third row in the sparse-dupes example, not replacing any existing values.
Output with this PR:
URI=dense-no-dupes
a b c
0 1 2 3
1 4 5 6
2 7 8 9
URI=sparse-no-dupes <---------- fixed
a b c
0 1 2 3
1 4 5 6
2 7 8 9
URI=sparse-dupes
a b c
0 1 2 3
0 4 5 6
1 7 8 9