nanocube
nanocube copied to clipboard
Nanocubes for time ranges?
Hi there! I'm grappling with an OLAP style problem and I'm hoping to apply nanocubes, but I'm not entirely sure how well my problem maps to this domain.
I've got an event stream representing changes to a set of entities. Something like 30 million entities, each of which might have a dozen dimensions. New events for each entity could arrive years or seconds apart. There is no spatial component to the data.
I mostly answer queries along the lines of 'at midnight every day between 2015-01-01 and 2015-07-31, how many entities had dimensions A = 1, B = 8, C = 3'. Maybe a colloquial way of stating the problem could be 'at midnight each day, how many people are watching netflix, eating popcorn, and wearing red socks'. My event stream only tells me when events change.
So in Postgres (after months of research into the validity of this approach) I end up building table partitions for each dimension, each row containing the entity id, the dimension's value, and the tsrange for which this fact was true. Then the problem reduces to intersecting time ranges, and building plain old macro scale cubes to cache aggregation results. But the bloat is staggering: ~6 GB of compressed data when unpacked this way and indexed tops 120 GB, and I'm not even considering all the possible dimensions yet. I feel like I'm forcing myself towards a big data problem I shouldn't have.
How might one introduce the concept of an event with a duration into a nanocube? If you can point me in the right direction I'll be sure to contribute some sample code back to the repo :)
Hi,
That's honestly a hard problem. Given my limited, a few months, experience with nanocube, I can confidently claim that what you are asking for is not really possible with the nanocube. The reason for that is data points in nanocube's cannot have time intervals, each data point is a one and only one occurence of the event.
My naive suggestion, conceptually pretty similar to yours, is joining all the latest changes of a dimension of an entity with the entity itself and then filter out based on your criteria. To contrast this with your solution is that I am omitting explicit time intervals, as a possible optimization, whatever happened last to dimension A of an entity before midnight of 2015-01-01 is the value of the dimension A of the entity. I am really not sure this would make any difference, but I like to think this will your limit the size of your dataset to number changes, asymptotically. And will make the problem simpler, since you are working on a Time Series.
What I would do if I encountered such problem is that I would dump the aforementioned joined entity-change table to a file on a HDFS cluster and run an Spark task to filter out rows of interest, i.e. latest rows of entities before the time interval we are interested in.
That's my $.02, hope it helps.
On 5 February 2016 at 10:16, Hagmonk [email protected] wrote:
Hi there! I'm grappling with an OLAP style problem and I'm hoping to apply nanocubes, but I'm not entirely sure how well my problem maps to this domain
I've got an event stream representing changes to a set of entities Something like 30 million entities, each of which might have a dozen dimensions New events for each entity could arrive years or seconds apart There is no spatial component to the data
I mostly answer queries along the lines of 'at midnight every day between 2015-01-01 and 2015-07-31, how many entities had dimensions A = 1, B = 8, C = 3' Maybe a colloquial way of stating the problem could be 'at midnight each day, how many people are watching netflix, eating popcorn, and wearing red socks' My event stream only tells me when events change
So in Postgres (after months of research into the validity of this approach) I end up building table partitions for each dimension, each row containing the entity id, the dimension's value, and the tsrange for which this fact was true Then the problem reduces to intersecting time ranges, and building plain old macro scale cubes to cache aggregation results But the bloat is staggering: ~6 GB of compressed data when unpacked this way and indexed tops 120 GB, and I'm not even considering all the possible dimensions yet I feel like I'm forcing myself towards a big data problem I shouldn't have
How might one introduce the concept of an event with a duration into a nanocube? If you can point me in the right direction I'll be sure to contribute some sample code back to the repo :)
— Reply to this email directly or view it on GitHub https://github.com/laurolins/nanocube/issues/45.
Best wishes, Mehmet M. INANC.
@mehmetminanc, thanks for the comments. It is something of a hard problem :)
I do wonder whether nanocubes are truly unable to handle this. Thinking more, my instinct tells me that the spatial dimensions could be used/abused to model time. You could map the tsrange into latitude (leaving the longitude dimension at 0), then map the unique entity identifier to a time value. A query for spatial coordinates across the full dataset would normally give you the list of events that occurred in that region … in this model it would return you a list of entities that experienced changes in that time range.
I remembered this general approach from
They suggest modeling time ranges as a (time,duration) tuple which makes them mappable onto a coordinate system.
I feel like the nanocube structure itself is quite generalized and my problem is likely a case of not knowing the right question to ask :)
Now, that is some proper abuse! I really liked the idea.
I have several concerns though, main one is the fact that this is theoretically the same with indexing the tsrange, or in my solution latest-state? If you forget that nanocube is in memory and postgresql is on disk - by default -, they will both have more or less the same tree structure. Hence, very similar (log(N)) computational complexity of retrieving an event, whether it be tsrange, latest-state or spatial lookup. If you think this is a practical problem of disk-lookup vs memory-lookup you can cache your postgresql tables up to memory altogether and still use the queries you've already come up with.There are definitely ways to load postgres tables entirely in memory, even of the given size.
Another problem is that nanocube, almost by definition, will aggregate tsranges that occurred at the same or very close times, even though you have got a quite a bit of precision - 2^25 - in latitude dimension, it's still likely that this will happen. This problem does not look undoable since you've still got the longitude dimension ready to be abused, but will make the mapping function non-trivial.
Good luck, and please keep up.
On 5 February 2016 at 20:39, Hagmonk [email protected] wrote:
@mehmetminanc https://github.com/mehmetminanc, thanks for the comments. It is something of a hard problem :)
I do wonder whether nanocubes are truly unable to handle this. Thinking more, my instinct tells me that the spatial dimensions could be used/abused to model time. You could map the tsrange into latitude (leaving the longitude dimension at 0), then map the unique entity identifier to a time value. A query for spatial coordinates across the full dataset would normally give you the list of events that occurred in that region … in this model it would return you a list of entities that experienced changes in that time range.
I remembered this general approach from IndexiGoh, Cheng Hian, et al. "Indexing temporal data using existing B+-trees." Data & Knowledge Engineering 18.2 (1996): 147-165. https://www.comp.nus.edu.sg/%7Eooibc/stbtree95.pdf. They suggest modeling time ranges as a (time,duration) tuple which makes them mappable onto a coordinate system.
I feel like the nanocube structure itself is quite generalized and my problem is likely a case of not knowing the right question to ask :)
— Reply to this email directly or view it on GitHub https://github.com/laurolins/nanocube/issues/45#issuecomment-180492834.
Best wishes, Mehmet M. INANC.
Abusing databases is a favorite pastime :)
Here's a little more detail that might make things more concrete (and sometimes ideas materialize during explanation) One note is that I'm actually only interested in the aggregations across time, I can afford to dip back into the source tables to get a specific set of entities (as long as I know the dimensions that were used)
What kills me in Postgres with tsranges ends up being with the joins. Once you are interested in four or five dimensions (not uncommon at all in my use case) then you are starting the stress the query planner. Once you start down that path, you're starting to use CTEs and other optimization fences to influence the planner's decisions. It gets ugly.
[side note: I watched a talk by someone who 'solved' this problem quite creatively. Dispatch multiple combinations of the query with the CTE order shuffled around, then just use results from the quickest query and kill the rest! A distributed Darwinian query planner …]
A basic query might look like
select
dim1.entity_id,
(dim1.event_range * dim2.event_range)
from dimension_a dim1
join dimension_b dim2 using (entity_id)
where
dim1.value = 'red'
and dim2.value = 'netflix'
and not isempty(dim1.event_range * dim2.event_range)
After the joins and intersections are complete, I'm left with rows that give me the entity ID and the time range over which all the dimensions were true for this entity. The two ways I've dealt with this are:
- use generate_series to step over each bucket of interest, and count the number of matching rows in my intersected table. Basically:
select
g.s, count(*)
from
generate_series(now() - interval '1 year', now(), '1 day') g(s),
intersected_ranges r
where
r.event_range @> g.s
You can speed this up in Postgres by using transaction in which you build a "temporary on commit drop" intersected_ranges table (it won't be in memory but it's not logged and will be dropped on commit). Then slap a sp-gist index on the event_range column, then perform the query above. Crazy to build a temp table and index mid-query, but this totally works very very well.
- build the same intersected_ranges table and then use
unnestto translate the tsrange into a start/stop event. So
['2015-01-01 12:34:56', '2015-02-02 12:12:12')
becomes a row like
| t | u |
|---|---|
| '2015-01-01 12:34:56' | 1 |
| '2015-02-02 12:12:12' | -1 |
Now it's normally fast to do an aggregation and window query something like:
select
t::day,
sum(u) as "daily total",
sum(sum(u)) over (order by t::day) as "rolling total"
from
unnested_table
group by 1 order by 1
| day | daily total | rolling total |
|---|---|---|
| '2015-01-01' | 5 | 5 |
| '2015-02-02' | 3 | 8 |
| '2015-02-02' | -2 | 6 |
The downside of this approach is that if there's a source data error (which can happen) that error will be carried throughout the rest of the calculation, whereas dipping into the events at intervals is somewhat more resilient.
In both cases you still have to wear the cost of the joins and the intersections, and then managing the growth of the roll-up tables and requests for new intersections.
One potential nanocube approach would be to materialize the entire state of the all the entities at some point in time, on every single day, and have it compute aggregates as if they were daily events. That could mean tens of millions of data points each day, with a dozen or so dimensions. These entities never go away or expire, either. I'm not sure if that starts to overwhelm things, and it would be expensive to compute, but I could experiment to see where the boundaries are.
Are you willing to hack around the source code? If you made a few changes to the TimeSeriesEntryType class to support signed integers, then maybe you could think of "started eating popcorn" as an event of type popcorn and value 1, and "ended eating popcorn" as an event of type popcorn and value -1.
Now you have your time series as usual, but to say "between 12:00 and 13:00" you could look at the difference between a query at 13:00 and another at 12:00. EDIT: No, I'm wrong. a single query at 12:00 is enough to give you the number of "active events".
This is sorta kinda modeling things as dirac deltas and letting the aggregation infrastructure "integrate them into heaviside deltas"
Oh nice, let me experiment with that. I can totally hack something like that together. Thanks for the pointer!
This wouldn't necessarily get you the intersections for free though, would it? If I streamed in signed events this way, maybe for "eating popcorn" and "watching netflix", and I know those events came from the same entity via a unique ID, the dream would be to query for any set at a given time: people eating popcorn but not watching netflix, people doing both, people watching netflix without popcorn.
(yes I totally wrote "people eating netflix" as a set but caught it just in time ;)
This wouldn't necessarily get you the intersections for free though, would it?
Not for free, nor easily. And it's worse than that: if you want to keep the intersections around, you'll have to add them as categories by themselves: "netflix_and_popcorn", etc, and that means that if you query for the sum over all categories, you'll get the wrong result. So you'll have to be careful with the way the presentation layer will display the data as well.
And it's worse than that: if you want to keep the intersections around, you'll have to add them as categories by themselves
This isn't much of a regression from what I'm dealing with now. We have a giant roll-up table with all the interesting intersections materialized on a day by day basis. If I can get a much more compact and speedy version of that, I'm winning.
If we go right down to netflix_and_popcorn that means you can drop the entity ID all together and just accumulate unique identifiers that represent each possible combination. A single entity that had these attributes would generate the set {netflix, popcorn, netflix_and_popcorn} and you could go off and count those. That makes the code slightly simpler, I guess.