TileDB icon indicating copy to clipboard operation
TileDB copied to clipboard

optimize random read from dense matrix

Open mikejiang opened this issue 4 years ago • 5 comments

Tiledb forum won't allow me to put more than two links in the post so I have to switch to here for the advice.

We are currently extend our cytometry tool with tiledb support. Here is the performance comparison between h5 and tiledb. https://rpubs.com/wjiang2/607678

As shown, tiledb automatically beat h5 on s3 , which is great! I am hoping to get better read speed on local storage as well.

I've followed some tips from https://docs.tiledb.com/developer/performance-tips, such as choosing proper tile size and shape. But so far haven't been able to get comparable speed as h5.

A little background of the cytometry data, it is typically dense 2d matrix, each row represents a cell(or event), each column represents one measurement(or channel) for that cell. The typical IO access patterns are random slicing through rows and cols. which is what I benchmarked in that rpub document.

Here is the code where the matrix is written https://github.com/RGLab/cytolib/blob/tiledb/inst/include/cytolib/CytoFrame.hpp#L195-L216

and here is where data is read by selecting certain entire columns https://github.com/RGLab/cytolib/blob/tiledb/inst/include/cytolib/TileCytoFrame.hpp#L418-L452

and here is for both col&row random indexing https://github.com/RGLab/cytolib/blob/tiledb/inst/include/cytolib/TileCytoFrame.hpp#L461-L483

For h5, I am not doing row indexing at h5 level and simply always chunking and reading each entire column and then subsetting in memory.

The R frontend for cytolib is flowWorkspace package https://github.com/RGLab/flowWorkspace/tree/tile

also multithread doesn't seem to affect my results https://github.com/RGLab/cytolib/blob/tiledb/inst/include/cytolib/TileCytoFrame.hpp#L189

The Rmarkdown code should be reproducible. But tiledb branch of both cytolib and flowWorkspace are still at early stage of development, so let me know if you run into trouble of building and running the example code.

I am using TileDB/dev

> tiledb::tiledb_version()
major minor patch 
    2     0     0 

Thanks for the excellent work!

mikejiang avatar May 02 '20 18:05 mikejiang

Thanks a lot for the detailed report. We'll certainly dig into your code and see if there is a config issue or an optimization that needs to be done in core. Please allow for a couple of days as we are preparing for the TileDB 2.0 release this week.

stavrospapadopoulos avatar May 02 '20 18:05 stavrospapadopoulos

Another related the question: how to efficiently read float array from tiledb into double buffer, right now I have to read it into a temporary float buffer

vector<float> buf(nrow * ncol);
query.set_buffer("mat", buf);

Then manually copy over to the double buffer

arma::Mat<double> data(nrow, ncol);
for(int i = 0; i < nrow * ncol; i++)
		data.memptr()[i] = buf[i];

The copying is a significant overhead based on my profiling. I wonder if there is better way. Btw, libhdf5's read API supports read h5 data into any arbitrary type,

dataset.read(data.memptr(), PredType::NATIVE_FLOAT ,memspace, dataspace);

and I believe it is doing the conversion internally, but it is a lot faster than my own copying.

mikejiang avatar May 04 '20 21:05 mikejiang

Yeah, we have discussed about this internally. It'd be a great feature. Would you like to open a separate issue about this or post a feature request at https://tiledb.canny.io/? We'll be happy to add it to our roadmap.

stavrospapadopoulos avatar May 04 '20 21:05 stavrospapadopoulos

Done. see #1633

mikejiang avatar May 04 '20 21:05 mikejiang

Thank you!

stavrospapadopoulos avatar May 04 '20 22:05 stavrospapadopoulos