datacube-core icon indicating copy to clipboard operation
datacube-core copied to clipboard

find_datasets_lazy isn't actually lazy

Open Kirill888 opened this issue 6 years ago • 4 comments

Expected behaviour

  • Expect datasets to stream out of DB, first dataset arriving way before query completes
  • Expect memory usage to be bounded no matter the size of the result set

Actual behaviour

  • Memory usage is proportional to result set
  • The entire response from DB is read and cached in memory, before first dataset becomes available
  • Only construction of Dataset objects from DB response is happening in a "lazy" fashion

Kirill888 avatar Aug 24 '18 00:08 Kirill888

I ran some experiments on NCI installation (using VDI)

dss = dc.find_datasets_lazy(product='wofs_albers')
rr = ds_stream_test_func(dss, lambda ds: ds.id)
print(rr.text)

Where ds_stream_test_func iterates over all dss and computes xor of all uuids

def ds_stream_test_func(dss, get_uuid):
    ## omitted setup
    for ds in dss:
        t0 = t0 or timer()
        uu ^= get_uuid(ds).int
        count += 1
    ## compute stats prepare printout (omitted)

Results

Ran this on two largest products we have wofs and ls7 albers. In case of wofs that has 2.7 million datasets it took 30 minutes and ~5.7Gb of RAM before the first dataset came out of the "lazy" query, then it took another 5 minutes to iterate over all datasets. RAM usage is about ~2K per dataset, which is consistent with "serialised to json" dataset size, so the entire query result is cached in RAM before being parsed by ORM into datasets lazily.

  • wofs_albers
Count: 2,704,343
       1269.4 per second
Total: 2130.488 sec
TTFB : 1801.541 sec
.....: 2560530453B44E6E92E4ABEE92F4ED05
..
peak memory: 5786.51 MiB, increment: 5648.92 MiB
CPU times: user 5min 42s, sys: 35 s, total: 6min 17s
Wall time: 35min 30s
  • ls7_nbar_albers
Count: 1,247,646
       2919.2 per second
Total: 427.400 sec
TTFB : 268.955 sec
.....: 11EB56257F3B053339B395AC5E628A7A
..
peak memory: 2950.53 MiB, increment: 2812.87 MiB
CPU times: user 2min 43s, sys: 11.8 s, total: 2min 55s
Wall time: 7min 7s

Kirill888 avatar Aug 27 '18 04:08 Kirill888

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 08 '20 07:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 07 '20 23:12 stale[bot]

This is really hard to do with the current implementation, but should be possible with the database changes coming for ODC 2.0.

omad avatar May 18 '23 03:05 omad