TileDB
TileDB copied to clipboard
optimize random read from dense matrix
Tiledb forum won't allow me to put more than two links in the post so I have to switch to here for the advice.
We are currently extend our cytometry tool with tiledb support. Here is the performance comparison between h5 and tiledb. https://rpubs.com/wjiang2/607678
As shown, tiledb automatically beat h5 on s3 , which is great! I am hoping to get better read speed on local storage as well.
I've followed some tips from https://docs.tiledb.com/developer/performance-tips, such as choosing proper tile size and shape. But so far haven't been able to get comparable speed as h5.
A little background of the cytometry data, it is typically dense 2d matrix, each row represents a cell(or event), each column represents one measurement(or channel) for that cell. The typical IO access patterns are random slicing through rows and cols. which is what I benchmarked in that rpub document.
Here is the code where the matrix is written https://github.com/RGLab/cytolib/blob/tiledb/inst/include/cytolib/CytoFrame.hpp#L195-L216
and here is where data is read by selecting certain entire columns https://github.com/RGLab/cytolib/blob/tiledb/inst/include/cytolib/TileCytoFrame.hpp#L418-L452
and here is for both col&row random indexing https://github.com/RGLab/cytolib/blob/tiledb/inst/include/cytolib/TileCytoFrame.hpp#L461-L483
For h5, I am not doing row indexing at h5 level and simply always chunking and reading each entire column and then subsetting in memory.
The R frontend for cytolib is flowWorkspace package https://github.com/RGLab/flowWorkspace/tree/tile
also multithread doesn't seem to affect my results https://github.com/RGLab/cytolib/blob/tiledb/inst/include/cytolib/TileCytoFrame.hpp#L189
The Rmarkdown code should be reproducible. But tiledb branch of both cytolib and flowWorkspace are still at early stage of development, so let me know if you run into trouble of building and running the example code.
I am using TileDB/dev
> tiledb::tiledb_version()
major minor patch
2 0 0
Thanks for the excellent work!
Thanks a lot for the detailed report. We'll certainly dig into your code and see if there is a config issue or an optimization that needs to be done in core. Please allow for a couple of days as we are preparing for the TileDB 2.0 release this week.
Another related the question: how to efficiently read float array from tiledb into double buffer, right now I have to read it into a temporary float buffer
vector<float> buf(nrow * ncol);
query.set_buffer("mat", buf);
Then manually copy over to the double buffer
arma::Mat<double> data(nrow, ncol);
for(int i = 0; i < nrow * ncol; i++)
data.memptr()[i] = buf[i];
The copying is a significant overhead based on my profiling. I wonder if there is better way. Btw, libhdf5's read API supports read h5 data into any arbitrary type,
dataset.read(data.memptr(), PredType::NATIVE_FLOAT ,memspace, dataspace);
and I believe it is doing the conversion internally, but it is a lot faster than my own copying.
Yeah, we have discussed about this internally. It'd be a great feature. Would you like to open a separate issue about this or post a feature request at https://tiledb.canny.io/? We'll be happy to add it to our roadmap.
Done. see #1633
Thank you!