TileDB-Py Efficient 'stream'-processing

Hi, I'm currently evaluating TileDB for our project and I'm trying to find out how to efficiently read data. So let's say we create a DenseArray like this:

    d_xy = tiledb.Dim(ctx, "xy", domain=(1, 256 * 256), tile=32, dtype="uint64")
    d_u = tiledb.Dim(ctx, "u", domain=(1, 128), tile=128, dtype="uint64")
    d_w = tiledb.Dim(ctx, "w", domain=(1, 128), tile=128, dtype="uint64")
    domain = tiledb.Domain(ctx, d_xy, d_u, d_w)
    a1 = tiledb.Attr(ctx, "a1", compressor=None, dtype="float64")
    arr = tiledb.DenseArray(
        ctx,
        array_name,
        domain=domain,
        attrs=(a1,),
        cell_order='row-major',
        tile_order='row-major'
    )

When reading from the array using the slicing interface, in slices that fit the d_xy tiling, for example arr[1:33], I noticed that a lot of time is spent copying data (I can provide a flamegraph if you are interested). So I'm trying to understand what is happening behind the scenes: in the Domain I created, the cells have a shape of (32, 128, 128), right? And are they saved linearly to disk?

I found the read_direct method, which should not involve a copy, but as it reads the whole array it won't work for huge arrays, and it won't be cache efficient. We would like to process the data in nice chunks that fit into the L3 cache, so we thought working cell-by-cell would be optimal.

Maybe using an interface like this:

iter = arr.cell_iterator(attrs=['a1'], ...)
for cell in iter:
    assert type(cell['a1']) == np.ndarray
    # ...do some work on this cell...

This way, workloads that process the whole array can be implemented such that TileDB can make sure that the reading is done efficiently. If the processing is distributed on many machines, cell_iterator would need some way to specify what partition to return cells from.

(The cell could also have some information about its 'geometry', i.e. what parts of the array it consists of)

As an alternative, maybe the read_direct interface could be augmented to allow reading only parts of the array, and err out if you cross a cell boundary. That way, TileDB users can build something like the above interface themselves.

I'm just brain dumping, so let me know if this kind of feedback is useful!

cc @uellue

Apr 12 '18 12:04 sk1p

Quickly testing with read_direct reveals that still a memmove is done:

selection_015

This is a flamegraph acquired with linux perf of just using read_direct to read a whole 8GB dataset into memory. Something like this:

ctx = tiledb.Ctx()
arr = tiledb.DenseArray.load(ctx, "name")
arr.read_direct("a1")

Reading the file in smaller chunks would get rid of the page fault, but still a huge part is just spent copying data.

Now my question would be: is this an inherit limitation of the design of TileDB, or can it be optimized to do 'zero copy'-ish I/O in the future? That is, read a contiguous cell of an uncompressed array from disk directly into a buffer / numpy array.

Btw. the testing was done using TileDB 1.2.2 and TileDB-py 46c3853

Apr 12 '18 16:04 sk1p

(apologies for the late reply)

Hello Alexander,

First of all, we would like to thank you for your interest in TileDB and for taking the time to write an excellent issue including a clear explanation and an informative flame graph!

I see two main issues here:

1. Need for a more efficient read_directmethod The C API supports a layout called TILEDB_GLOBAL_ORDER which has the same effect as the one you mention: there is no re-organization of the cells in main memory - the cells are just read as they are from the disk and copied into the user buffers. The Python API by default reads in TILEDB_ROW_MAJOR and this is why you observe this extra copying. However, as you observed already, in the way you define the domain and read the data, this should reduce automatically to TILEDB_GLOBAL_ORDER. This is something we should (and we can) address internally in TileDB. I opened issue 527 in core TileDB - we will work on it very soon.

2. Use memmap for reading uncompressed arrays This will achieve a zero-copy read from the disk directly into your buffers. We used to have this functionality in early versions of TileDB. We removed it because we refactored our code so that we abstract the IO behind a Virtual Filesystem, as we started to target cloud storage backends (for which memmap is irrelevant). However, we were meaning to bring back our memmap functionality. I opened issue 526 in core TileDB - once again, we are planning to address this very soon.

Some extra notes:

The _memmove_avx_unaligned_... may need a little more investigation. We may have to memalign the buffers created by our Python API internally for the numpy arrays that store the results (in the C API, this falls on the user upon creation of her buffers).
We are overhauling our dense/sparse read/write algorithms, significantly optimizing them. We will run more perf analysis after refactoring is done. We are also constantly optimizing the Python API, so hopefully you will see a significant performance boost in the upcoming weeks.

I am closing this issue for now, since I opened the related issues in the TileDB repo.

Thank you once again for your valuable feedback! We look forward to more! :)

Apr 13 '18 16:04 stavrospapadopoulos

I am re opening this because it would be nice to track progress on these issues. Also we should expose TileDB's GLOBAL_ORDER layout on the Python side which would be necessary for a streaming operation with minimal copies.

Also it would be nice to expose a HL iterator over the discrete tiles in the dense case.

Apr 13 '18 20:04 jakebolewski

@sk1p Another quick optimization would be to disable caching by setting config param sm.tile_cache_size to 0. This will save you one full copy.

Apr 14 '18 20:04 stavrospapadopoulos

Thank you for your response!

@stavrospapadopoulos disabling the cache already gave a nice speedup, thanks for the tip!

Looks like you are on the right track efficiency-wise! Two remarks:

memmap could be very relevant for cloud storage backends, at least for HDFS! If you are looking at short-circuit reads, you can almost completely circumvent the datanode. Short-circuit reads are implemented using file descriptor passing over unix domain sockets, so you should be able to memory map them
While looking into memory alignment is important, I think it is not as important as completely eliminating the copy. According to the internet (see comments also), on recent hardware even unaligned AVX performs very well.

Regarding short circuit reading, I think that may involve some more work - I don't know if libhdfs exposes a suitable interface. It is of course only relevant in combination with information on data locality, which according to your documentation is planned but not there yet.

Apr 16 '18 11:04 sk1p

Thanks for the notes @sk1p.

Btw, we are aware of short-circuit reads for HDFS (cc @npapa). This is in our roadmap and in fact we will address it along with posix/win mmap. Please stay tuned :).

Apr 18 '18 22:04 stavrospapadopoulos

TileDB-Py TileDB-Py copied to clipboard

Efficient 'stream'-processing

TileDB-Py
TileDB-Py copied to clipboard