ArcticDB Enhancement 1721: arbitrary clause ordering

Enhancement 1721: arbitrary clause ordering

Open alexowens90 opened this issue 4 months ago • 0 comments

To be rebased after #1834 is merged

Reference Issues/PRs

Closes #1721 Closes #245

Performance:

Benchmarked using 8 cores, with mimalloc preloaded, and lmdb as the storage backend Data of the form

                        tick type       bid       ask
2020-01-01 08:00:00.000       ASK       NaN  0.291217
2020-01-01 08:00:00.001       BID  0.271128       NaN
2020-01-01 08:00:00.002       ASK       NaN  0.664834
2020-01-01 08:00:00.003       ASK       NaN  0.098223
2020-01-01 08:00:00.004       BID  0.751502       NaN

i.e. tick type is a string column containing "BID" or "ASK" with equal probability, and the bid and ask columns contain random floats between 0 and 1 if the tick type matches the column name, or NaN otherwise

1 tick every millisecond (60k ticks per minute)
24m ticks per day (8 hours)
6B ticks per year (250 days)
~100GB on disk (randomness and NaNs compress poorly, raw data is ~179GB)

Performance (with default 100k rows per segment):

Reading (6B is all data, 3B is with half the date range)
- Reading 6B ticks took 28.9s
- Reading 3B ticks took 13.3s
  - i.e. scales linearly in date range covered
Filtering on tick type column to one of "BID" or "ASK"
- Filtering 6B ticks took 42.7s
- Filtering 3B ticks took 20.7s
  - i.e. scales linearly in date range covered, ~50% slower than raw reading time
Resampling down to minute frequency, taking the max of the bid column
- Resampling 6B ticks to 100,000 mins took 19.s
- Resampling 3B ticks to 50,000 mins took 9.7s
  - i.e. scales linearly in date range covered, ~33% faster than raw reading time
Combination of filter and resample described above
- Filtering then resampling 6B ticks to 100,000 mins took 39.1s
- Filtering then resampling 3B ticks to 50,000 mins took 19.3s
  - i.e. scales linearly in date range covered, ~40% slower than raw reading time

Restructuring after the filter and before the filter takes ~100ms for 6B ticks (i.e. 0.25% of the total time). Tail latency introduced by the restructuring "stop the world" approach is ~2ms in this example (time to filter one segment).

Everything ~10% faster with 1m rows per segment

Sep 30 '24 12:09 alexowens90

ArcticDB ArcticDB copied to clipboard

Enhancement 1721: arbitrary clause ordering

Reference Issues/PRs

Performance:

ArcticDB
ArcticDB copied to clipboard