datacube-core
datacube-core copied to clipboard
find_datasets_lazy isn't actually lazy
Expected behaviour
- Expect datasets to stream out of DB, first dataset arriving way before query completes
- Expect memory usage to be bounded no matter the size of the result set
Actual behaviour
- Memory usage is proportional to result set
- The entire response from DB is read and cached in memory, before first dataset becomes available
- Only construction of
Dataset
objects from DB response is happening in a "lazy" fashion
I ran some experiments on NCI installation (using VDI)
dss = dc.find_datasets_lazy(product='wofs_albers')
rr = ds_stream_test_func(dss, lambda ds: ds.id)
print(rr.text)
Where ds_stream_test_func
iterates over all dss and computes xor of all uuids
def ds_stream_test_func(dss, get_uuid):
## omitted setup
for ds in dss:
t0 = t0 or timer()
uu ^= get_uuid(ds).int
count += 1
## compute stats prepare printout (omitted)
Results
Ran this on two largest products we have wofs and ls7 albers. In case of wofs that has 2.7 million datasets it took 30 minutes and ~5.7Gb of RAM before the first dataset came out of the "lazy" query, then it took another 5 minutes to iterate over all datasets. RAM usage is about ~2K per dataset, which is consistent with "serialised to json" dataset size, so the entire query result is cached in RAM before being parsed by ORM into datasets lazily.
-
wofs_albers
Count: 2,704,343
1269.4 per second
Total: 2130.488 sec
TTFB : 1801.541 sec
.....: 2560530453B44E6E92E4ABEE92F4ED05
..
peak memory: 5786.51 MiB, increment: 5648.92 MiB
CPU times: user 5min 42s, sys: 35 s, total: 6min 17s
Wall time: 35min 30s
-
ls7_nbar_albers
Count: 1,247,646
2919.2 per second
Total: 427.400 sec
TTFB : 268.955 sec
.....: 11EB56257F3B053339B395AC5E628A7A
..
peak memory: 2950.53 MiB, increment: 2812.87 MiB
CPU times: user 2min 43s, sys: 11.8 s, total: 2min 55s
Wall time: 7min 7s
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This is really hard to do with the current implementation, but should be possible with the database changes coming for ODC 2.0.