ArcticDB
ArcticDB copied to clipboard
Enhancement 1721: arbitrary clause ordering
To be rebased after #1834 is merged
Reference Issues/PRs
Closes #1721 Closes #245
Performance:
Benchmarked using 8 cores, with mimalloc preloaded, and lmdb as the storage backend Data of the form
tick type bid ask
2020-01-01 08:00:00.000 ASK NaN 0.291217
2020-01-01 08:00:00.001 BID 0.271128 NaN
2020-01-01 08:00:00.002 ASK NaN 0.664834
2020-01-01 08:00:00.003 ASK NaN 0.098223
2020-01-01 08:00:00.004 BID 0.751502 NaN
i.e. tick type
is a string column containing "BID" or "ASK" with equal probability, and the bid
and ask
columns contain random floats between 0 and 1 if the tick type matches the column name, or NaN
otherwise
- 1 tick every millisecond (60k ticks per minute)
- 24m ticks per day (8 hours)
- 6B ticks per year (250 days)
- ~100GB on disk (randomness and
NaNs
compress poorly, raw data is ~179GB)
Performance (with default 100k rows per segment):
- Reading (6B is all data, 3B is with half the date range)
- Reading 6B ticks took 28.9s
- Reading 3B ticks took 13.3s
- i.e. scales linearly in date range covered
- Filtering on
tick type
column to one of "BID" or "ASK"- Filtering 6B ticks took 42.7s
- Filtering 3B ticks took 20.7s
- i.e. scales linearly in date range covered, ~50% slower than raw reading time
- Resampling down to minute frequency, taking the max of the
bid
column- Resampling 6B ticks to 100,000 mins took 19.s
- Resampling 3B ticks to 50,000 mins took 9.7s
- i.e. scales linearly in date range covered, ~33% faster than raw reading time
- Combination of filter and resample described above
- Filtering then resampling 6B ticks to 100,000 mins took 39.1s
- Filtering then resampling 3B ticks to 50,000 mins took 19.3s
- i.e. scales linearly in date range covered, ~40% slower than raw reading time
Restructuring after the filter and before the filter takes ~100ms for 6B ticks (i.e. 0.25% of the total time). Tail latency introduced by the restructuring "stop the world" approach is ~2ms in this example (time to filter one segment).
Everything ~10% faster with 1m rows per segment