Question about the size of tdb files
Hi TileDB, Would you please show some light to understand the size of tdb files?
The below example creates a sparse array with both int32 for attribute and coordinates. The code writes three points. But, the size of output files are much smaller than expected. With int32, all files should have size at 96, right? Any default compression method or magic inside ?
Thanks. Bin
tiledb_attribute_alloc(ctx, "a", TILEDB_INT32, &a);
tiledb_dimension_alloc(ctx, "rows", TILEDB_INT32, &dim_domain[0], &tile_extents[0], &d1);
tiledb_dimension_alloc(ctx, "cols", TILEDB_INT32, &dim_domain[2], &tile_extents[1], &d2);
int coords_rows[] = {1, 2, 2};
int coords_cols[] = {1, 4, 3};
int data[] = {1, 2, 3};
Result file size:
a0.tdb 32B
d0.tdb 57B
d1.tdb 57B
Hi Bin,
Thanks for the note! The example you portray here seems to come straight from e.g. quickstart_sparse.cc. Exactly how it layed out on disk depends on a few more things (tile/extent size? number of values per cell? nullable?) which the schema information contains. We have pretty printers in the tiledb command-line tool, its underlying TileDB Python package, R package and other places. Here is a print from R:
tiledb_array_schema(
domain=tiledb_domain(c(
tiledb_dim(name="rows", domain=c(1L,4L), tile=4L, type="INT32"),
tiledb_dim(name="cols", domain=c(1L,4L), tile=4L, type="INT32")
)),
attrs=c(
tiledb_attr(name="a", type="INT32", ncells=1, nullable=FALSE)
),
cell_order="COL_MAJOR", tile_order="COL_MAJOR", capacity=10000, sparse=TRUE, allows_dups=FALSE,
coords_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
offsets_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
validity_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("RLE"),"COMPRESSION_LEVEL",-1)))
)
Note that we use default Zstd compression (on 'coords', 'offsets' (none here) and 'validity' (ditto)) unless something else is specified to override the defaults. There are several other other dedicated compressors available and documented, and compression could be added for the attribute.
The actual definition of the underlying data structure is also fully public and in the github repo. While the minimal write of three int32_t would be 12 bytes, we have 32 bytes here per the tile spec. As we offer this fully documented open source API you can take advantage of it to read the data and consume it in the proper data structure.
Please let us know if you any more question. There is also our open community slack for even more informal conversations.
Hi @BinDong314 -- there are indeed default compressors AKA filters -- here are a couple starting links (just saw @eddelbuettel 's post seconds ago):
- https://docs.tiledb.com/main/background/key-concepts-and-data-format#data-layout
- https://docs.tiledb.com/main/background/key-concepts-and-data-format#tile-filtering
Thanks @eddelbuettel @johnkerl for detailed explanation of the code and results.
-
Wired question, any API to disable the compression on coordination?
-
NO default compression on the attributes (values), right?
-
Out of curiosity, what is 'validity' (ditto)?
BTW: First time to know TileDB command-line, it should be very useful. is it this one? https://github.com/TileDB-Inc/TileDB-CLI I tried the python version, but unfortunately, it does not work.
Traceback (most recent call last):
File "/usr/local/bin/tiledb", line 5, in <module>
from tiledb_cli.root import root
File "/usr/local/lib/python3.9/site-packages/tiledb_cli/__init__.py", line 1, in <module>
from .root import root
File "/usr/local/lib/python3.9/site-packages/tiledb_cli/root.py", line 3, in <module>
from .cloud import cloud
File "/usr/local/lib/python3.9/site-packages/tiledb_cli/cloud.py", line 1, in <module>
import tiledb
File "/usr/local/lib/python3.9/site-packages/tiledb/__init__.py", line 36, in <module>
from .cc import TileDBError
ImportError: dlopen(/usr/local/lib/python3.9/site-packages/tiledb/cc.cpython-39-darwin.so, 0x0002): Symbol not found: _tiledb_array_open_at_with_key
Referenced from: /usr/local/lib/python3.9/site-packages/tiledb/cc.cpython-39-darwin.so
Expected in: /Users/dbin/work/sparse/soft/TileDB/build/install/lib/libtiledb.dylib
The "/Users/dbin/work/sparse/soft/TileDB/build/install/lib/libtiledb.dylib" is the installed library from source on my MacOS.
- I can get you an example
- Ditto
- Validity involves nullable attributes -- the data buffer is numeric/string/whatever for each cell, and the validity buffers are booleans indicating non-null/null for each cell
- Re the CLI --
tiledb_array_open_at_with_keyis a recent deprecation, and core 2.16 and TileDB-Py 0.22 have both recently happened -- I suspect you have one and not the other. You can check with
>>> import tiledb, tiledb.libtiledb
>>> tiledb.version
VersionHelper(version='0.22.0', version_tuple=(0, 22, 0))
>>> tiledb.libtiledb.version()
(2, 16, 0)
Core 2.15.* goes with TileDB-Py 0.21., and core 2.16. goes with TileDB-Py 0.22.*
If you pip install tiledb or pip install -U tiledb, without any local pre-existing core install, we handle this for you. However, if you use installed library from source (as many developers do -- myself included!) -- we offer that flexiblity -- but then in this more manual/custom setup, the developer needs to make sure the core and TileDB-Py versions checked out are compatible.
@BinDong314 re 1 & 2:
Startingi from https://github.com/TileDB-Inc/TileDB/blob/dev/examples/cpp_api/filters.cc, edit to leave all filter lists empty -- comment out the lines that add filters -- this way you are explicitly setting empty filter lists.
Then check:
A = tiledb.open('./filters_array')
A.domain.dump()
A.attr('a1').dump()
A.attr('a2').dump()
-- there are no filters.
I hope that helps!
Thanks @johnkerl for pointing the code. Now I am testing basic functions of writing data to TileDB. I will try the filter later.
While I tried below code, it reports below error. I might do something wrong here. Could you help to give some hints to make it right? Beyond this, is it possible to write subarray (i.e., contiguous region) to sparse array without breaking it into cells ? As you can see in my code, I might need to write cells between region_start and region_end to the same sparse array.
The output
Test system: TileDB
dim_name = d0
dim_name = d1
[8, 9, 10, 11, 12, 13, 14, 15]
dim_name = d0
[8, 9, 10, 11, 12, 13, 14, 15]
dim_name = d1
libc++abi: terminating with uncaught exception of type tiledb::TileDBError: [TileDB::Dimension] Error: Coordinate 105553157279872 is out of domain bounds [1, 16] on dimension 'd0'
I call the below code with, other parameters can be randomly intialized
array_size = {16, 16}
point_coordinate = { 8, 8 , 9, 9 , ....., 15, 15}
Code:
#include <iostream>
#include <tiledb/tiledb>
using namespace tiledb;
// Name of array.
// std::string array_name("quickstart_sparse_array");
typedef uint64_t hsize_t;
void create_array(std::vector<hsize_t> &array_size, std::string array_name, std::string attribute_name)
{
int dim = array_size.size();
// Create a TileDB context.
Context ctx;
// The array will be 4x4 with dimensions "rows" and "cols", with domain [1,4].
Domain domain(ctx);
std::string dim_name;
// std::vector<hsize_t> dim_domain;
hsize_t dim_extent_size;
std::array<hsize_t, 2> dim_domain;
for (int i = 0; i < dim; i++)
{
dim_name = "d" + std::to_string(i);
dim_domain[0] = 1;
dim_domain[1] = array_size[i];
dim_extent_size = array_size[i];
domain = domain.add_dimension(Dimension::create<hsize_t>(ctx, dim_name, dim_domain, dim_extent_size));
std::cout << "dim_name = " << dim_name << "\n";
}
// domain.add_dimension(Dimension::create<int>(ctx, "rows", {{1, 4}}, 4))
// .add_dimension(Dimension::create<int>(ctx, "cols", {{1, 4}}, 4));
// The array will be sparse.
ArraySchema schema(ctx, TILEDB_SPARSE);
schema.set_domain(domain).set_order({{TILEDB_ROW_MAJOR, TILEDB_ROW_MAJOR}});
FilterList fl(ctx);
schema.set_coords_filter_list(fl);
// Add a single attribute "a" so each (i,j) cell can store an integer.
schema.add_attribute(Attribute::create<int>(ctx, attribute_name));
// Create the (empty) array on disk.
Array::create(array_name, schema);
}
std::vector<hsize_t> extract_per_dim(std::vector<hsize_t> &coo_flatted, int dim, int current_dim)
{
std::vector<hsize_t> coo_per_dim;
size_t coo_n = coo_flatted.size() / dim;
coo_per_dim.resize(coo_n);
for (size_t i = 0; i < coo_n; i++)
{
coo_per_dim[i] = coo_flatted[i * dim + current_dim];
}
return coo_per_dim;
}
void debug_print_vector(const std::vector<hsize_t> &vec)
{
std::cout << "[";
for (size_t i = 0; i < vec.size(); ++i)
{
std::cout << vec[i];
if (i < vec.size() - 1)
{
std::cout << ", ";
}
}
std::cout << "]" << std::endl;
}
void write_array(std::string file_name, std::string dataset_name, std::vector<hsize_t> &coo_flatted, std::vector<int> &point_data_buf_write, int dim, std::vector<hsize_t> ®ion_start, std::vector<hsize_t> ®ion_count, std::vector<int> ®ion_data_buf_write, int point_region_flag)
{
// Write some simple data to cells (1, 1), (2, 4) and (2, 3).
// std::vector<int> coords_rows = {1, 2, 2};
// std::vector<int> coords_cols = {1, 4, 3};
// std::vector<int> data = {1, 2, 3};
// Open the array for writing and create the query.
if (point_region_flag == 0 || point_region_flag == 2)
{
Context ctx;
Array array(ctx, file_name, TILEDB_WRITE);
Query query(ctx, array, TILEDB_WRITE);
query = query.set_layout(TILEDB_UNORDERED);
query = query.set_data_buffer(dataset_name, point_data_buf_write);
std::string dim_name;
std::vector<hsize_t> coo_flatted_per_dim;
for (int i = 0; i < dim; i++)
{
coo_flatted_per_dim = extract_per_dim(coo_flatted, dim, i);
dim_name = "d" + std::to_string(i);
query = query.set_data_buffer(dim_name, coo_flatted_per_dim);
debug_print_vector(coo_flatted_per_dim);
std::cout << "dim_name = " << dim_name << "\n";
}
// .set_data_buffer("a", data)
// .set_data_buffer("rows", coords_rows)
// .set_data_buffer("cols", coords_cols);
// Perform the write and close the array.
query.submit();
array.close();
std::cout << "Write TileDB's point is done !\n";
}
}
int write_tiledb(std::string file_name, std::vector<hsize_t> &array_size, std::vector<hsize_t> ®ion_start, std::vector<hsize_t> ®ion_count, std::vector<int> ®ion_data_buf_write, std::vector<hsize_t> &point_coordinate, std::vector<int> &point_data_buf_write, int point_region_flag, bool read_flag)
{
Context ctx;
if (Object::object(ctx, file_name).type() != Object::Type::Array)
{
create_array(array_size, file_name, "aa");
write_array(file_name, "aa", point_coordinate, point_data_buf_write, array_size.size(), region_start, region_count, region_data_buf_write, point_region_flag);
}
}
I might do something wrong here. Could you help to give some hints to make it right?
The coo_flattened_per_dim variable here:
coo_flatted_per_dim = extract_per_dim(coo_flatted, dim, i);
dim_name = "d" + std::to_string(i);
query = query.set_data_buffer(dim_name, coo_flatted_per_dim);
is destructed at the end of each for loop iteration, so when you go to query.submit(), some of the backing buffers are no longer valid. The vector (or other backing data) needs to be valid until after the query.submit() call returns.
Beyond this, is it possible to write subarray (i.e., contiguous region) to sparse array without breaking it into cells ?
Not currently; all dimension coordinates need to be supplied.
@ihnorton thanks for identifying the error. Yeah, it works now.
Just to confirm what I understand is right about "Not currently; all dimension coordinates need to be supplied."
Say I have a sparse array 16 by 16, I need to write region from [0, 0] to [2,2]. I need to write [0, 0], [0,1], [0,2], ... , [2,0], [2,1], [2,2] as.
d1cos = [0, 0, 0, ..., 2, 2, 2]
d2cos= [0, 1, 2, ..., 0, 1, 2]
query.set_data_buffer("rows", d1cos).set_data_buffer("cols", d2cos);
I can not use subarray
const std::vector<int> subarray = {0, 0, 2, 2};
query.set_subarray(subarray)
Am I right?
Am I right?
Yes
Thanks @ihnorton for the confirm.
Another question regarding read data by points
I know users can use set_subarray to extract data
// Prepare the query
Query query(ctx, array);
query.set_subarray(subarray)
.set_layout(layout)
.set_data_buffer("a", data)
.set_data_buffer("rows", rows_id)
.set_data_buffer("cols", cols_id);
Is it possible to read data by points like below? I tried my own but always get wrong results in data.
Query query(ctx, array, TILEDB_READ);
std::vector<int> data;
data.resize(4);
//only read (1, 1) (2, 2) (3, 3), (4,4)
std::vector<int> rows_id = {1, 2,3,4}, cols_id ={1, 2, 3,4};
query = query.set_data_buffer("a", data);
query = query.set_data_buffer("rows", rows_id);
query = query.set_data_buffer("cols", cols_id);
query.submit();
Hi @BinDong314
A constraint such as '//only read (1, 1) (2, 2) (3, 3), (4,4)' cannot be expressed on fragments in a sparse array with row and col dims. If we select '1:4' for each of the dimension, we get a projection of 4 x 4 values. "Extracting just the diagonal" is not an operator we have.
You could possibly force an accomodation if row and col were attributes by boolean ORing four AND condictions of row=i, col=i using our query condition framework (on attributes, not on dimensions). Or you could of course alter your schema ever so slightly...
So the bigger question is what you really want to do with your data and what schema you want. We would love to help you here, but a slightly more complete sketch may help instead of waltzing over one (somewhat limited in scope because meant to be illustrative) introductory example.
Thanks @eddelbuettel for the explanation. Here I just use the "diagonal" as a example by accident. General idea is to read any random points by their coordinates on sparse data.
Bests, Bin
You can of course select by dimension. That is what they are for.
But as noted you get an 'outer product', not an inner product, because (informally speaking) multiple dimensions are combined by OR not by AND. I find it really is best to play with some local ad-hoc array in whichever scripting language you are most at ease at -- may be Python for you, and is R for me.
Closing for now, please comment if further discussion needed and we can reopen (or post on https://forum.tiledb.com).