tdigest icon indicating copy to clipboard operation
tdigest copied to clipboard

apply to big data?

Open maodouchen opened this issue 7 years ago • 1 comments

Does this apply to big data? 500 million?

maodouchen avatar Aug 16 '18 11:08 maodouchen

Consider how the internal data structure (an RB tree of centroids) will grow with N, M, where

N = number of input values M = range of values

If you look at the implementation or the paper, you'll see that at a simplified level, we can say that if the data point is within a certain distance of an existing centroid, it will add to its weight. Else we create a new centroid or possibly merge two centroids.

So basically the number of centroids scales with M. Given that M is fixed, you have constant storage. But data points are not individually stored.

You can always run a test to confirm your suspicions about a data structure being usable for large input or not ;)

It's generally a safe bet, as well, that online algorithms (processing streams of data) tend to be pursued exactly because storage scaling with input size is a problem at scale, and the online algorithm will have some space guarantees.

dt-rush avatar Nov 19 '18 02:11 dt-rush