ArcticDB icon indicating copy to clipboard operation
ArcticDB copied to clipboard

Enhancement 1721: arbitrary clause ordering

Open alexowens90 opened this issue 4 months ago • 0 comments

To be rebased after #1834 is merged

Reference Issues/PRs

Closes #1721 Closes #245

Performance:

Benchmarked using 8 cores, with mimalloc preloaded, and lmdb as the storage backend Data of the form

                        tick type       bid       ask
2020-01-01 08:00:00.000       ASK       NaN  0.291217
2020-01-01 08:00:00.001       BID  0.271128       NaN
2020-01-01 08:00:00.002       ASK       NaN  0.664834
2020-01-01 08:00:00.003       ASK       NaN  0.098223
2020-01-01 08:00:00.004       BID  0.751502       NaN

i.e. tick type is a string column containing "BID" or "ASK" with equal probability, and the bid and ask columns contain random floats between 0 and 1 if the tick type matches the column name, or NaN otherwise

  • 1 tick every millisecond (60k ticks per minute)
  • 24m ticks per day (8 hours)
  • 6B ticks per year (250 days)
  • ~100GB on disk (randomness and NaNs compress poorly, raw data is ~179GB)

Performance (with default 100k rows per segment):

  • Reading (6B is all data, 3B is with half the date range)
    • Reading 6B ticks took 28.9s
    • Reading 3B ticks took 13.3s
      • i.e. scales linearly in date range covered
  • Filtering on tick type column to one of "BID" or "ASK"
    • Filtering 6B ticks took 42.7s
    • Filtering 3B ticks took 20.7s
      • i.e. scales linearly in date range covered, ~50% slower than raw reading time
  • Resampling down to minute frequency, taking the max of the bid column
    • Resampling 6B ticks to 100,000 mins took 19.s
    • Resampling 3B ticks to 50,000 mins took 9.7s
      • i.e. scales linearly in date range covered, ~33% faster than raw reading time
  • Combination of filter and resample described above
    • Filtering then resampling 6B ticks to 100,000 mins took 39.1s
    • Filtering then resampling 3B ticks to 50,000 mins took 19.3s
      • i.e. scales linearly in date range covered, ~40% slower than raw reading time

Restructuring after the filter and before the filter takes ~100ms for 6B ticks (i.e. 0.25% of the total time). Tail latency introduced by the restructuring "stop the world" approach is ~2ms in this example (time to filter one segment).

Everything ~10% faster with 1m rows per segment

alexowens90 avatar Sep 30 '24 12:09 alexowens90