ArcticDB
ArcticDB copied to clipboard
Poor performance when reading as_of a date with many early versions deleted
Describe the bug
There is a feature called tombstone all that is supposed to prevent version search having to walk the entire historical version list when the early versions have all been deleted.
It works for as_of = a version number (it will return quickly not having found the version)
However when as_of = a date is used it can be slow. This is much more apparent using AWS where the latency is higher.
When there are thousands of versions the read can take several minutes.
Steps/Code to Reproduce
import arcticdb as adb import pandas as pd import numpy as np from datetime import datetime, timedelta
arctic = adb.Arctic(<AWS S3 uri>) lib = arctic.get_library('adb_bugs', create_if_missing=True)
N = 3 df = pd.DataFrame( index=pd.date_range("20240101", periods=N), data={'col': np.arange(0., N)} )
write 500 versions
sym1 = 'asof_slow_read' for i in range(500): lib.write(sym1, df)
remove early versions
lib.delete(sym1)
add one more version
lib.write(sym1, df)
this is slow (12s in my test)
as_of = datetime.now() - timedelta(days=1) lib.read(sym1, as_of=as_of)
this is fast (171ms in my test)
lib.read(sym1, as_of=499)
Expected Results
Results are as expected. This is a performance issue.
OS, Python Version and ArcticDB Version
Python 3.10 Linux Linux version 5.15.133.1-microsoft-standard-WSL2 arcticdb 4.3.1
Backend storage used
AWS S3
Additional Context
This is possibly related to this issue (failure to observe tombstone correctly). It may be easier to solve the two together
https://github.com/man-group/ArcticDB/issues/1385
This will be a failure to short-circuit on fast-tombstone all keys, as the logic is a bit more complex than when searching by exact version number.