ngsi-timeseries-api Querying state at a given time

We'd like to be able to query the set of data points collected for a given entity type at a specified point in time. For example, we'd like to be able to tell

how many bikes are currently available for rent;
what was the average air quality yesterday at 11am;
how much waste was collected this morning at 9;
how many EV charging stations are currently busy charging vehicles.

Conceptual model

As in #164, we have some automaton and the data points we collect in Quantum Leap are related to automaton states---more accurately to state evolution, but we're not going to bother with all that here. Just one thing about state though. Some of the above questions manifestly have to do with resource state---e.g. bike available, station charging---while for others resource state seems to be irrelevant---e.g. air quality. We can still keep the automaton model for both situations since when the state isn't relevant we can think of there being just one state. (Technically, state = equivalence class of values, one state = whole set of admissible values.)

Now fix a time point p (e.g. 10:55 yesterday morning) and consider all the data points in Quantum Leap for a set entity type T (e.g. a waste container) that were collected at p. It may well be that even with hundreds of T entities you'll wind up with very few data points or even none simply because devices won't always send data at exactly the same point in time or data points are sparse. But it may also be the case that the last data point received for an entity is close to what it'd have been if we'd had an actual measurement at p---e.g. the weather forecast of 5 minutes ago is very likely to be still current. By the same token, a data point received soon after p may be a good approximation of the value at p.

So we need a more flexible way of querying data than just collecting whichever data points were received at p on the dot. Here's one solution then. Consider a time interval [p0, p1] and among the data points in there collect those that are closest in time to p, then run your query on this data set. The below diagram visualises the approach.

querying-state-at-time-point

Here's a more formal recipe for how to collect those points into a data set U. For each entity, call d the corresponding sequence of data points---one of the lines parallel to the time axis in the diagram above. Then compute the set of indexes K

K = { k ∈ ℕ | time(d[k]) ∈ [p0, p1] ∧ |time(d[k]) - p| = min |time(d[h]) - p| }

where time is the time point associated to a data point and h varies over all the d indexes. Define the set D as

D = ∅                      if K = ∅
D = { d[m], m = min K }    otherwise

and finally take U to be the union of all those Ds.

The selection interval [p0, p1] comes in handy as we can use it to limit the data points to a subset of the whole table (=potentially more efficient query) as well as defining assumptions on data:

p0 = p = p1: we're making no assumptions here as we're querying exactly the data received at p.
p0 < p = p1: we're assuming the last data point received for each entity is still current at p.
p0 = p < p1: for each entity, the fist data point received after p is a good approximation of the actual value at p.
p0 < p < p1: for each entity any data point in [p0, p1] that's closest in time to p is a good approximation of the actual value at p. (Note. Technically, this works even if p is outside of [p0, p1] but I'm not sure how that'd be useful in real life...)

Are any of the above fair assumptions? Well in general it depends on the circumstances, the kind of measurement, how far in time is the last data point from the time point of interest (stale data!), etc. But at least we're able to choose interval width depending on the situation at hand. For example, it could be that the last value before p is a good approximation as long as that value was received no earlier than half an hour ago---think weather. Any value received before then, even if it's the latest we have, wouldn't be useful. Accordingly we can choose the interval [p - 30min, p]. In general, having a selection lower bound mitigates the risk of bringing in stale data. Likewise, the upper bound can be used to separate the wheat from the chaff.

Note: Statistical analysis

What we're doing here really is rudimentary curve fitting, where we assume the curve is a step function. This may work well if the function represents state changes, but could totally break down for e.g. measurements. In general, it'd be better to recast all this into a statistical learning problem where we use the data points in the entity table as training data to approximate a certain function, subject to an irreducible error. But for now, let's keep the conceptual model we've fleshed out here since it's simpler and can be readily implemented in Quantum Leap.

Proposed Solution

The basic idea is to implement the collection and querying process outlined in the conceptual model with a parameterised SQL query where parameter values get passed in from the Web API---i.e. pretty much the same set up as we have for the other queries. The hardest bit is the query, so that's what we'll flesh out here. The rest can be done through string manipulation as we do at the moment in the Crate translator, but a better approach would be to use an AST/interpreter approach as done for the geo-queries.

Since the query is a bit involved, I think it's best explained with an example. So let's start by creating a fictitious t_entity table and populating it with the same data as that on the diagram in the conceptual model section.

create table t_entity (id integer, data_point varchar, time_ix integer);

insert into t_entity values (1, 'x1', 1);
insert into t_entity values (3, 'y1', 1);
insert into t_entity values (1, 'x2', 2);
insert into t_entity values (2, 'y2', 2);
insert into t_entity values (2, 'x3', 3);
insert into t_entity values (1, 'x4', 4);
insert into t_entity values (1, 'y5', 5);
insert into t_entity values (2, 'x5', 5);
insert into t_entity values (3, 'z6', 6);

(If you want to fiddle with the example as you read along, I suggest you copy and paste into the aptly named SQL Fiddle online tool and select PostgresSQL from the database menu :-)

Just like in the conceptual model example, we'd like to tell how many entities were in the x state at time point p = 4, limiting query scope to the interval [p0, p1] = [0, 6]. As a first step, we collect those rows whose time index is in [p0, p1] and as we're at it, we also compute time distance of each data point from p as well as the minimum such distance by entity:

SELECT
  *,
  abs(time_ix - 4) AS _dist,
  min(abs(time_ix - 4)) OVER entity_points AS _min_dist,
  row_number() OVER entity_points AS _row_n
FROM t_entity
WHERE 0 <= time_ix AND time_ix <= 6
WINDOW entity_points AS (PARTITION BY id)

(Run the query to have a better grasp of what it does :-) In real life, the 0, 6, and 4 above would come from the parameters specified to the Web API. The row number will come in handy later, so keep reading :-) With those candidate data points under our belt, it's easy to figure out which ones are the closest in time to p:

WITH candidate_points AS (
    SELECT
      *,
      abs(time_ix - 4) AS _dist,
      min(abs(time_ix - 4)) OVER entity_points AS _min_dist,
      row_number() OVER entity_points AS _row_n
    FROM t_entity
    WHERE 0 <= time_ix AND time_ix <= 6
    WINDOW entity_points AS (PARTITION BY id)
)
SELECT
  *,
  min(_row_n) OVER (PARTITION BY id) AS _min_row_n
FROM candidate_points
WHERE _dist = _min_dist

This brings back x4 for entity 1, x3 and x5 for entity 2, and z6 for entity 3---if you have the diagram handy, those are the points in bold. Which shows we have some loose ends to tie up: equidistant points. In fact, we'd like to be able to pick at most one point for each entity. As explained in the conceptual model, we can pick the point with the smallest index which is why we've been threading that pesky row number through the queries. With that in hand, we can finally collect our target data points (the set U in the conceptual model) and then query them:

WITH target_points AS (
  WITH closest_points AS (
    WITH candidate_points AS (
      SELECT
        *,
        abs(time_ix - 4) AS _dist,
        min(abs(time_ix - 4)) OVER entity_points AS _min_dist,
        row_number() OVER entity_points AS _row_n
      FROM t_entity
      WHERE 0 <= time_ix AND time_ix <= 6
      WINDOW entity_points AS (PARTITION BY id)
    )
    SELECT
      *,
      min(_row_n) OVER (PARTITION BY id) AS _min_row_n
    FROM candidate_points
    WHERE _dist = _min_dist
  )
  SELECT *
  FROM closest_points
  WHERE _row_n = _min_row_n
)

SELECT *
FROM target_points

This brings back x4 for entity 1, x3 for entity 2, and z6 for entity 3. Sweet. Now plonking in the query on the target points is child's play:

SELECT count(*)
FROM target_points
WHERE data_point LIKE 'x%'

This returns 2 as expected. Of course we'll have to build the select attributes and the where clause from input parameters, but then again, that isn't uncharted territory.

Note: Crate vs Timescale

While this is standard SQL, CTEs aren't supported by Crate. Also, the code could be simplified quite a bit if we were able to use window functions such as first_value and last_value which are only available in Crate's enterprise edition. None of this would be a problem in Timescale.

Alternatives

#164 proposes a more general framework to deal with state and time but at the moment we don't have enough time to develop those ideas further. Another alternative that springs to mind is statistical analysis as mentioned in the note above. Again, time is of the essence at the moment, so we'll leave that discussion for another day :-)

May 16 '19 16:05 c0c0n3

@c0c0n3 isn't the state at a given point in time the latest state of an entity before that point in time? you seem to consider it as a surrounding of a given point in time, which to me is not the answer to the question "what is the latest state before this point in time" (and introduce too much complexity)

so, for example, i would use this query for the "now":

SELECT * FROM t_entity t1 inner join (select id, max(time_ix) time_index from t_entity group by id) t2 on t1.id = t2.id and t1.time_ix = t2.time_index;

and this query for the p = 4:

SELECT * FROM t_entity t1 inner join (select id, max(time_ix) time_index from t_entity where time_ix <=4 group by id) t2 on t1.id = t2.id and t1.time_ix = t2.time_index;

then to count the entities with state y1 before p=4:

SELECT COUNT(*) FROM t_entity t1 inner join (select id, max(time_ix) time_index from t_entity where time_ix <=4 group by id) t2 on t1.id = t2.id and t1.time_ix = t2.time_index WHERE t1.data_point = 'y1'

May 29 '19 13:05 chicco785

isn't the state at a given point in time the latest state of an entity before that point in time?

Not sure. For example, what if the latest data point we have is a month old? Perhaps the device was decommissioned or just stopped working. I think we should provide the means to deal with stale data which is what the selection lower bound p0 does.

you seem to consider it as a surrounding of a given point in time,

Yep, that's right. Considering near points makes sense to me since it's more generic and caters for situations where you're not necessarily dealing with discrete state changes. For example, say we wanted to find out what was the average air quality at 3pm and we have a data point at 14:01 and another one at 15:01. Would you still consider the latest data point before 3pm, i.e. the one that's one hour old?! It stands to reason the 15:01 data point is a better approximation of the air quality at 15:00... In general, what makes sense depends on the situation at hand---air quality, parking lots, etc. So I reckon what we should have in Quantum Leap is a generic selection mechanism where we don't care about what specific selection criteria actually mean. That logic is in the query parameters (p0, p1, etc.), Quantum Leap just selects the data you ask for regardless of what that selection means in the situation at hand.

which to me is not the answer to the question "what is the latest state before this point in time"

It is indeed. Just take the selection upper bound p1 to be the same as the point in time p you're interested in---i.e. p0 < p = p1. Sorry if this wasn't clear from my notes above!

introduce too much complexity

Yeah, I agree. But if we also agree to use a selection interval, then I don't think this can be simplified further without using Crate enterprise features. Also take into account the "max time index" queries you suggest don't get rid of duplicates. Try e.g. running your "now" query using the same example data set plus this additional data point: insert into t_entity values (3, 'z6.1', 6);. To get rid of duplicates, I had to resort to that row number trick which I must admit isn't the easiest thing in this world. Perhaps there's a better way of doing this. Either way, if we want to keep the selection interval and get rid of duplicates, the resulting query will have to be a bit more complex of the "max time index" query, well at least that's my gut feeling.

Jun 03 '19 13:06 c0c0n3

Let's go now for a simple and supported solution, and later, eventually, extend. In general, we need to be more pragmatic, or implementing simple things will consume time we don't have.

An entity should not have multiple updates in the same point in time (I am not even sure considering context broker throttling this is possible) and this should solve your case. A lower time bound can be introduced, with my query.

Jun 03 '19 14:06 chicco785

Cool, but are we going to use your query for measurements too, e.g. air quality, and later improve it? Also, yes, perhaps we can assume duplicates are unlikely---even though I reckon they're possible given the fact we introduced custom time indexes. This would simplify the query. Anyhoo, below is the same query rewritten to be Crate-friendly, just in case we'll need it in the future :-)

SELECT *
FROM (
    SELECT
      *,
      min(_row_n) OVER (PARTITION BY id) AS _min_row_n
    FROM (
        SELECT
          *,
          abs(time_ix - 4) AS _dist,
          min(abs(time_ix - 4)) OVER entity_points AS _min_dist,
          row_number() OVER entity_points AS _row_n
        FROM t_entity
        WHERE 0 <= time_ix AND time_ix <= 6
        WINDOW entity_points AS (PARTITION BY id)
    ) candidate_points
    WHERE _dist = _min_dist
) closest_points
WHERE _row_n = _min_row_n

Jun 03 '19 16:06 c0c0n3

Latest value of airquality parameter should not be an issue, i suppose. or the question was in reference to data duplication? I tend to believe that entityId, time_index should be seen as a key. you may update it, but not duplicate it. feelings? @c0c0n3 @taliaga

Jun 03 '19 16:06 chicco785

Latest value of airquality parameter should not be an issue

I was thinking of that one example, where you have two data points, one at 14:01 and the other at 15:01 and you ask what was the air quality at 15:00. Wouldn't the 15:01 data point be a better approximation?

entityId, time_index should be seen as a key

I'd hope so too, at least for practical purposes, but don't have enough experience with IoT devices in the wild to actually make a call :-) For example is it possible to have time indexes with 1 sec precision and devices send data multiple times a second?

Jun 03 '19 16:06 c0c0n3

I was thinking of that one example, where you have two data points, one at 14:01 and the other at 15:01 and you ask what was the air quality at 15:00. Wouldn't the 15:01 data point be a better approximation?

while i agree that it would be a better approximation, it is a different semantic (and so query) than give me the last value before time x. as said, let's start simple.

I'd hope so too, at least for practical purposes, but don't have enough experience with IoT devices in the wild to actually make a call :-) For example is it possible to have time indexes with 1 sec precision and devices send data multiple times a second?

usually msec is the resolution, and if different devices sends data to orion within a time frame less than the throttling, only a single notification is sent, grouping all updates.

Jun 03 '19 17:06 chicco785

it is a different semantic (and so query) than give me the last value before time x

Yes indeed, which is why I thought we should support something more flexible than "last value before time x" or "first value after x". In this kind of situation "nearest point" seems to be more appropriate? Anyway, the point here is that if we provide a more generic mechanism, then the query writer can decide what's the best thing to do for the situation at hand. For example, if we're querying parking bay state, we may want to use "last value before x" or even better "last value before x if that value isn't older than a day, nothing otherwise". On the other hand, if we're querying air quality, we may want to pick the closest value to x even if that's slightly past x. If we use the proposed mechanism we can do all of this, whereas if we use "last before" we're stuck with something that only works in certain situations but not in others.

let's start simple.

agreed :-)

usually msec is the resolution

Then I don't think we'll ever have a problem in practice, unless all the entities in the notification batch happen to use a time index field with the same value but I don't think that's likely in practice and if it does happen it's probably a bug? This is the nasty scenario I have in mind:

device sends data every 500ms with timestamp t
agent maps t to fieldT but value gets truncated to nearest second
entity gets stored in Quantum Leap with custom time index = fieldT
for each (entity_id, time_index) we may have two corresponding data points in Quantum Leap

If I understand correctly, this is either something it's never going to happen or if it does happen it should be treated as a bug. So I guess that was I long way to say we don't need to worry about duplicates for now. Thanks for clarifying :-)

Jun 04 '19 07:06 c0c0n3

Stale issue message

Mar 10 '21 02:03 github-actions[bot]

ngsi-timeseries-api ngsi-timeseries-api copied to clipboard

Querying state at a given time

Conceptual model

Note: Statistical analysis

Proposed Solution

Note: Crate vs Timescale

Alternatives

ngsi-timeseries-api
ngsi-timeseries-api copied to clipboard