arctic Non-chronological writes break TickStore lib.min_date and lib.max

Non-chronological writes break TickStore lib.min_date and lib.max_date

Open rueberger opened this issue 6 years ago • 9 comments

Arctic Version

1.58.0

Arctic Store

TickStore

Platform and version

ubuntu 16.04

Description of problem and/or code sample that reproduces the issue

I am experiencing issues where lib.min_date > lib.max_date. I'm still narrowing down the problem but my guess is that the problem results from manner in which I am performing writes - the issue does not occur after rewriting the symbol data to a new symbol. Additionally when reading the full table, TimeSeries data is out of order, sorting! is printed.

I am writing in chronological chunks, walking backwards in time. For example the timestamps of consecutive writes have this form:

1. [t_{n+1}, ... t_{n+m}]
2. [t_0, ... t_n]
3. ....

EDIT: sorry, the above is not quite right. That is what a consecutive bunch of writes to kafka look like, timestamps of the typical arctic write would be the concatenation of the above.

I will try to find a simple reproducible example.

Jan 27 '18 03:01 rueberger

I believe that the root cause can, at least in part, be tracked down to the implicit assumption that DataFrames passed to TickStore.write have monotonically increasing indices.

For instance, this assumption is made at https://github.com/manahl/arctic/blob/v1.58.0/arctic/tickstore/tickstore.py#L565

Should be a simple fix, I'll submit a pull request.

Jan 27 '18 22:01 rueberger

Tickstore definitely assumes all ticks are written in order given that thats how market updates work in general. I'm curious why you are writing them out of order?

Jan 28 '18 21:01 bmoscon

True, time does move forward in general... but isn't it better to explicitly check assumptions?

Inserting historical data... just one of those things where it happens to be more elegant to do it backwards.

As for the fix, I was envisioning adding the following function to arctic._utils:

def as_sorted(dframe):
    if dframe.index.is_monotonic_increasing:
        return dframe
    else:
        warn("DataFrame is not monotonic increasing and must be sorted - may incur performance penalty")
        if dframe.index.is_monotonic_decreasing:
            dframe = dframe.iloc[::-1]
        else:
            dframe = dframe.sort_index()
        return dframe

And then using it in arctic.tickstore.TickStore.write.

I don't think this should cause significant overhead for current workloads, although I could not confirm that pandas computes dframe.index.is_monotonic_increasing on dataframe initialization.

What do you think?

Jan 29 '18 01:01 rueberger

the calculation for it is different depending on type of index. Here is one such example:

https://github.com/pandas-dev/pandas/blob/8cbee356da1161c56c64f6f89cb5548bcadc3e44/pandas/core/indexes/multi.py#L836

As you can see, its not trivial (i.e. it isnt just set once)

Jan 30 '18 17:01 bmoscon

Also experiencing a related issue where arctic.tickstore._assert_nonoverlapping_data incorrectly raises an OverlappingDataException.

Root cause is implicit unchecked assumption that data passed to tickstore.write is contiguous in time.

Another relatively easy fix, which I will include in my pull request.

As it seems that TickStore was designed primarily with real-time data in mind, I'm now wondering about the wisdom of adapting it to handle such cases that are not encountered when processing real-time data. Perhaps it make more sense to have a separate implementation for the general case?

Additionally, should I expect to encounter more problems in trying to useTickStore in this way? Any idea how 'bad' it would be to adapt it? (how deeply the assumption of real-time is baked into arctic innards/design philosophy)

Jan 30 '18 23:01 rueberger

I'm not sure what you mean by contiguous and/or real-time? There is absolutely an assumption in TickStore that the events are ordered in time within each write, and the writes are non-overlapping. I don't think very much else is assumed. When we're loading historical data, we usually break it into well-defined chunks (i.e. days, or maybe weeks/months for lower-frequency data), which can then be written in any order.

We should really open-source our own kafka2mongo implementation...

Jan 31 '18 07:01 richardbounds

My apologies, after further consideration:

contiguity is not well-defined without additional constraints (namely that the interval between ticks is fixed and known ahead of time)
The fix is only easy with this added constraint. The general case is not so nice.

By real-time, I mean that all data written is strictly earlier than any data already existing in the DB. A better term would be strictly chronological.

When we're loading historical data, we usually break it into well-defined chunks (i.e. days, or maybe weeks/months for lower-frequency data), which can then be written in any order.

I also try to do this.... but the problem is that this does not place nice with kafka at all. I can bypass kafka for historical imports, but I don't see how such a guarantee could be provided for real-time data incoming to arctic without sacrificing many of the benefits of using kafka in the first place/offloading much of the DB logic to the kafka consumer.

For instance, this adds a lot of complication in handling producer failures. The only assumption I make of topics is that they will never contradict existing data. There is certainly no guarantee of monotonicity.

We should really open-source our own kafka2mongo implementation...

Please do!!

At this point, we have decided it makes sense to roll our own data store, as our needs are quite modest, and our use cases don't map onto Arctic all that well.

Nonetheless, this has been a fruitful adventure. I particularly like the idea of using kafka to do the heavy-lifting, allowing one put a simple client library into production.

EDIT: didn't meant to hit close

Jan 31 '18 21:01 rueberger

Ok, understood. We don't use Kafka for loading historical data (typically it comes to us as compressed daily files, so it is a relatively simple exercise to decompress and load them in parallel from a cluster).

As a bit of background, when we collect live, streaming event data we guarantee chronological ordering of events written to each Kafka partition (which is easy as the events are timestamped in the client thread that received them and the recipient threads are mapped 1:1 to Kafka partitions). We have multiple event collector processes running in hot-standby, with the active collector holding a zookeeper lock while it writes to kafka. Failure of a collector process does mean a momentary data-loss while the failover to one of the standby processes happens, but that's acceptable for our use case.

Feb 01 '18 04:02 richardbounds

@shashank88 @jamesblackburn @richardbounds I personally vote we just close this. We can add more checks but i dont think they add significant value - tickstore only works with ordered data. We could tweak the doc here to explicitly mention this (https://github.com/manahl/arctic/blob/master/docs/tickstore.md) bit i dont think this requires a code change. let me know - planning on closing this in the coming days

Sep 08 '19 15:09 bmoscon

arctic arctic copied to clipboard

Non-chronological writes break TickStore lib.min_date and lib.max_date

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

arctic
arctic copied to clipboard