datacube-core find_datasets_lazy isn't actually lazy

Expected behaviour

Expect datasets to stream out of DB, first dataset arriving way before query completes
Expect memory usage to be bounded no matter the size of the result set

Actual behaviour

Memory usage is proportional to result set
The entire response from DB is read and cached in memory, before first dataset becomes available
Only construction of Dataset objects from DB response is happening in a "lazy" fashion

Aug 24 '18 00:08 Kirill888

I ran some experiments on NCI installation (using VDI)

dss = dc.find_datasets_lazy(product='wofs_albers')
rr = ds_stream_test_func(dss, lambda ds: ds.id)
print(rr.text)

Where ds_stream_test_func iterates over all dss and computes xor of all uuids

def ds_stream_test_func(dss, get_uuid):
    ## omitted setup
    for ds in dss:
        t0 = t0 or timer()
        uu ^= get_uuid(ds).int
        count += 1
    ## compute stats prepare printout (omitted)

Results

Ran this on two largest products we have wofs and ls7 albers. In case of wofs that has 2.7 million datasets it took 30 minutes and ~5.7Gb of RAM before the first dataset came out of the "lazy" query, then it took another 5 minutes to iterate over all datasets. RAM usage is about ~2K per dataset, which is consistent with "serialised to json" dataset size, so the entire query result is cached in RAM before being parsed by ORM into datasets lazily.

wofs_albers

Count: 2,704,343
       1269.4 per second
Total: 2130.488 sec
TTFB : 1801.541 sec
.....: 2560530453B44E6E92E4ABEE92F4ED05
..
peak memory: 5786.51 MiB, increment: 5648.92 MiB
CPU times: user 5min 42s, sys: 35 s, total: 6min 17s
Wall time: 35min 30s

ls7_nbar_albers

Count: 1,247,646
       2919.2 per second
Total: 427.400 sec
TTFB : 268.955 sec
.....: 11EB56257F3B053339B395AC5E628A7A
..
peak memory: 2950.53 MiB, increment: 2812.87 MiB
CPU times: user 2min 43s, sys: 11.8 s, total: 2min 55s
Wall time: 7min 7s

Aug 27 '18 04:08 Kirill888

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 08 '20 07:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dec 07 '20 23:12 stale[bot]

This is really hard to do with the current implementation, but should be possible with the database changes coming for ODC 2.0.

May 18 '23 03:05 omad

datacube-core datacube-core copied to clipboard

find_datasets_lazy isn't actually lazy

Expected behaviour

Actual behaviour

Results

datacube-core
datacube-core copied to clipboard