uncertainties icon indicating copy to clipboard operation
uncertainties copied to clipboard

Cumulative sums are quadratic in time: is it possible to make them linear in time?

Open lebigot opened this issue 9 years ago • 14 comments
trafficstars

Sums are calculated in linear time, starting with version 3.0. But cumulative sums are calculated in quadratic time (2 minutes for 10,000 terms, on my machine—much faster than even a simple sum of 10,000 terms with version 2.x, but still quite slow). Could cumulative sums also be accelerated?

Note: time measurements must force the calculation of the standard deviation of each result. For instance, map(lambda x: x.std_dev, arr.cumsum()) with arr = unp.uarray(np.full(N, 1), np.full(N, 1)) and, e.g., N = 5000 is a correct code for doing a timing. Doing str(arr.cumsum()) is not, as only a certain number of terms is printed, so that not all standard deviations are actually calculated (since they are lazily evaluated). np.cumsum() is also not at all a good expression for timing measurements (for the same reason).

lebigot avatar Aug 21 '16 17:08 lebigot

An idea would be to somehow reuse the previous term in the next term of the cumulative sum, without copy (the copy in itself creates the quadratic time). This can be done in principle in the specific case of the summation of independent quantities, since a sum of linear expansions of independent variables does not have to combine their linear coefficients. A generalization would be to still not copy the coefficients associated to each variable in each term, but to maintain an additional "correction" that ultimately gives access to the linear coefficient of common variables—thus avoiding a copy.

A difficulty arises when multiplying each term of the cumulative sum by a constant: this should not result in a quadratic-time calculation (as is the case now, since term #n is a specific copy of n variables associated to their linear coefficient, so term #n requires a time O(n) in order to be multiplied by a coefficient, resulting in an overall quadratic calculation time for the multiplication of all the terms of the cumulative sum).

Another difficulty arises when calculating the standard deviation of each term: in principle the standard deviation of one term could be used for calculating that of the next term; now, each term depends on a linearly increasing number of variables, so the version 3.0 calculation of all the standard deviations is in itself quadratic.

Another idea would be to keep copies (and so the quadratic time) but to make the whole process somehow much faster.

lebigot avatar Aug 21 '16 18:08 lebigot

Faster cumulative summation would be really, really, useful as well.

andyfaff avatar Aug 22 '16 02:08 andyfaff

It's a challenge, as the cumulative sum of independent variables is made of terms with 1, 2, 3,… variables, which sums quadratically. The code would need to use a data structure for linear combinations of linear combinations, that is optimized (probably constant in time in the number of independent variables involved) for many operations (mostly multiplication by a factor, linear combination of a couple of linear combinations, and gathering of all the coefficients in front of a variable). This does not look obvious, but I can think about it.

lebigot avatar Aug 22 '16 06:08 lebigot

Notes to self with ideas about this quadratic time cumulative sum problem (below, "expressions" are essentially the linear combination that represents their linear expansion, so I use the term almost interchangeably with "linear combination"):

It would be nice to have a faster arr.cumsum(), but getting the uncertainty for expressions like the following should be fast too (this is not a given when calculations are done lazily like they are done in version 3.0.1):2*arr.cumsum(), arr.cumsum()[::-1],… Thus, for example, it might be tempting to use some method that works for arr.cumsum() by taking advantage of the fact that a+b is probably calculated before (a+b)+c, etc., but this method would not work with arr.cumsum()[::-1], since ((a+b)+c)+… is probably calculated before a+b in this array.

Idea for speeding this up (while unfortunately keeping a quadratic time): when new dictionaries (typically the dictionary of derivatives) are built, not building them element by element but using dict.copy() as much as possible could speed things up.

Idea for conceptualizing the problem: when handling uncertainties:

  • Variables are first created: they can be seen as leaves.
  • Variables are then combined progressively: this progressively builds a Directed Acyclic Graph from the leaves up. Here, I call this the "expressions DAG". The edges of this DAG are more precisely decorated by linear coefficients w_i. Each node i represents a linear combination: sum over child_node_j of w_j * LinearCombo(node j). Example: 3*(1x + 2y) + 5*(2x + 3z).

These two properties can be built upon (see below).

Idea for speeding up calculations: I guess that calculations more often involve independent variables than correlated variables. This can be a guiding principle, where speeding up the uncertainty calculation algorithm is first explored on expressions involving independent variables.

Idea for speeding up calculations: since expressions are built as a DAG from the variables up, in the case where all variables involved are independent, the uncertainty could be calculated on the fly (non lazily) in an efficient way: the variances of linear approximations directly sum (in a linear combination). This would handle arr.cumsum()[::-1]] nicely, since all the variances would be already calculated in the arr.cumsum() step—this would at least work for cumsum() and some variations of it like arr.cumsum()[::-1]].

Coherent set of ideas for handling correlated variables:

  • When the variances of the terms of the linear approximation cannot be summed directly because of correlations, a simple correction could be applied. This correction only needs to know the linear coefficients of the variables common to the terms (e.g., if we sum variances naively we get (1σ_X)^2 + (2σ_X)^2 where we should calculate [(1+2)*σ_X]^2: if we know that X has coefficients 1 and 2 in each term, we just have to add the correction [(1+2)^2 - (1^2 + 2^2)]*σ_X to the sum of variances).

  • It is natural to separate the calculation of the set of variables of an expression from the calculation of the coefficients of the common variables.

  • The coefficients of the common variables are more efficiently calculated in a lazy way: for example, (t+t)*2*3 should not yield expressions that contain the successive linear combinations t+t, 2*t+2*t and 6*t+6*t, as each step requires 2 multiplications (many more, in the general case). Instead, expression (t+t)*2*3 should be evaluated lazily: 3*2 = 6, then 6*(2*t). This is what is currently done (in version 3.0.1), but for all the variables of an expression instead of just the common variables between sub-expressions. Beyond accelerating the calculation of arr.cumsum(), this idea would speed up all calculations (essentially thanks to the idea of summing variances and then applying corrections for correlated variables).

  • Now, how do we get the common variables in two expressions in an efficient way (we'll see below about getting their coefficients)? If we have the set of all the variables under each expression node, this is easy. However, a difficulty is that this set has a quadratic memory footprint, for arr.cumsum() (the first array element has one term, the second two terms, etc.). I could not think of any general, efficient (i.e. linear in the number of nodes both in memory and access time) representation of the set of variables under each node of a graph. However, an alternative would be to only calculate this set of variables with the appropriate discriminative caching mechanism (i.e., instead of keeping in memory all the sets of variables, only some sets are kept):

    • A simple (but imperfect) idea is to "steal" the set of variables of an expression, when combining it with another expression, and to add to it the set of the other expressions (the stolen set can usefully be the largest set of variables). Thus, expression a+b would have the set {a, b} of variables. Expression (a+b)+c would steal the set {a, b} from expression a+b (which loses the knowledge of its set of variables) and add variable c to it. Thus, only one expression in a given expressions DAG contains its set of variables. This works with a calculation like arr.cumsum() but is not general: doing numpy.arange(10) + expression_with_many_variables would calculate 0 + expression_…, which would steal the (large) set of variables from expression_… and associate it with the new sum. This would be problematic for the next sum, 1 + expression_…, where the set of variables of expression_… would have to be calculated again (by going through its DAG down to the leaves).

    • I am not fully clear yet about a better, general algorithm for this discriminative caching, but I am thinking of using some kind of "dynamic caching": if the variables of an expression are required multiple times, then they are cached; essentially, this means that these variables could be stolen a few times (maybe once), but that future needed (re)calculations of their variables would store them. This would make arr.cumsum() (with independent variables in arr) fast (each terms steals the variables of the previous term and adds one new variable). This would make the other example above (numpy.arange(10) + expression_with_many_variables) work too: the variables of expression_with_many_variables would be calculated twice only (instead of 10 times)—a first time when built, a second time when calculating 1 + expression_… after the set of variable has been stolen by 0 + expression_….

      One constraint is that we don't want many expressions to cache data when only the larger expression that contains them gets requests repeatedly. One solution might be to have two kinds of requests (for variables, or for variables with coefficients or variance): requests that only pass data (that can be modified upstream), and requests that can cache and send a copy (if they deem necessary: otherwise they act like the first type of requests). A possible such mechanism would be as follows:

      • A "top" expression whose variance needs to be calculated would allow sub-expressions at a certain depth to cache data (set of variables, maybe the [linear] coefficient of some variables). We probably do not want to cache in all sub-expressions data like the set of variables, because the data could be quadratic, for instance in the case of a cumulative sum (quadratic either in terms of memory if a full set of variables is stored in each sub-expression, and in terms of computations if each sub-expression only points to the variables of each of its sub-expressions, when calculating the common variables at each step of the cumulative sum). In fact, caching everything is essentially what currently slows down cumulative sums of independent variables. It is natural to only want to cache data at the level (in the expression tree) typically used during calculations (as defined by repeated requests for the variables of a certain sub-expression).

      • Sub-expressions within the caching depth would only cache data if repeatedly asked to pass it along (with a certain threshold number of data passing, like maybe 2–3).

  • But then, how do we cache the calculation of the coefficients of correlated variables in each expression, so that we can speed up subsequent calculations? Nodes (of the expressions DAG) where the number of paths leading to any leaf increases can be decorated by a mapping from each common variable to its coefficient in each of its child node expressions. Example: when mixing (1+t) with (2+t) in expression 3*(1+t) + 5*(2+t), this expression could be tagged with the mapping {t: [3, 5]}, so that the variances of 3*(1+t) and 5*(2+t) can summed and then corrected by a correlation offset term. Another option might be to only use the list of coefficients for correctly calculating (correcting) the variance, and then storing the final coefficient ({t: 8}, in the example). Doing this only for nodes where a new correlation appears uses less memory and calculations than decorating each parent node later with the correct coefficients (such a non-lazy approach would not scale well with the number of correlated variables). How this can be handled when implementing the calculation of uncertainties remains to be investigated.

  • When common variables are identified in the terms of a new linear combination, the coefficients of common variables can be calculated by an iterative method similar to the lazy calculation of derivatives found in version 3.0.1 (https://github.com/lebigot/uncertainties/blob/eeae9c894fa50ca0521aff31ce27fd2012f916a3/uncertainties/core.py#L1477). This calculation is quite fast, since other (independent) variables can be ignored.

In conclusion:

  • This already gives at least one fast method for calculating all the example expressions mentioned in the first point (which include arr.cumsum()): immediate (non-lazy) summation of variances, caching of a single set that contains the variables of a given node (through stealing), correction of the sum of variances by looking for the coefficients of common variables (with a fast iterative algorithm similar to the one used for the lazy evaluation of derivatives in version 3.0.1).
  • The additional idea of discriminative caching allows more kinds of expressions to be calculated fast.
  • Open issues:
    • Some implementation details are left open (like how to really calculate and use the decoration of only some nodes with the coefficients of the variables common to some of their terms).
    • I do not yet have a clear picture of the cost of manipulating fully correlated expressions. This might be an interesting case to consider now, so as to guide the pieces of the implementation and algorithm that are not yet fully defined.

lebigot avatar Sep 15 '16 22:09 lebigot

@lebigot I have not followed everything involved in the cumulative sums issue, but mentioning Directed Acyclic Graph makes me think that a lot of this might be implemented in Theano. I know it's another dependency, but if we want fast computations on arrays, it will be more efficient (and requiring less work) then re-implementing all these low level optimizations from scratch. Theano also does gradient calculations through automatic differentiation I believe. For instance, PyMC3, which might have some intersection, with Uncertainties package uses Theano in the backend. What do you think?

rth avatar Sep 16 '16 12:09 rth

I'll have a closer look at the Theano and PyMC3 features and implementations. Now, I understand that the Theano equivalent of the cumulative sum of many numbers with uncertainty—and more generally any operation involving many numbers with uncertainty—would be the creation of 100,000s of Theano expressions: I am not sure that uncertainties is often used this way (?).

Now, I must say that my discussion above is not restricted to arrays: its Directed Acyclic Graphs have nodes that represent more general linear combinations, that can come from other situations (like a loop that sums or multiplies together various expressions).

lebigot avatar Sep 16 '16 14:09 lebigot

Note to self: to-do list:

  • Implement a consistent subset of the ideas above (maybe there are still a few unforeseen grey areas in the implementation?).
  • Look at Theano's features and see whether they could serve as the basis for general and efficient uncertainty calculations.

lebigot avatar Sep 16 '16 15:09 lebigot

An additional general idea is that lazy evaluations can allow calculations to be optimized more easily: in effect, lazy evaluations build a "calculation plan" (like Spark does) that can in principle be optimized. This idea should be explored more.

This calculation plan could be used for deciding which memory/time-consuming quantities should be passed along (instead of being stored, thus prompting an expensive copy later) to which numbers with uncertainties (set of variables, set of coefficients for correlated variables that must be used for correcting the naive "non-correlated standard deviation", etc.).

Another point point is that it is basically impossible to guess what usage will be done of existing expressions: there is no need to cache too many quantities, as they might never be used and take time to build.

In summary, the optimization strategy for calculating uncertainties might be: build expressions lazily as much as possible so as to capture part of the calculation plan, and be ready respond to a variety of uncertainty calculation requests efficiently (for example, a+b+c+…, cumsum([a, b, c,…]) and cumsum([a, b, c,…])[::-1] build the exact same tree of expressions, but will prompt different calculations: the algorithm should handle all of the efficiently). This is not obvious to do but would be great to have.

lebigot avatar Feb 14 '17 12:02 lebigot

Introduction

I spent 2–3 full days thinking more about how to optimize calculations in uncertainties, including cumulative sums, so as to dig deeper into the ideas above. The result looks like a solution to this issue of a quadratic calculation time for cumulative sums, with possibly more generally an acceleration of many of the calculations done by uncertainties. Here is a report.

Context

Cumulative sums in uncertainties 3.0.1 are calculated in quadratic time. The reason is that for variables a, b, c, etc., cumsum([a, b, c,…]) = [a, a+b, a+b+c,…] contains a quadratic total number of variables (1 variable in the first term, 2 in the second one, etc.): simply building the set of variables in the result takes a quadratic time. This fundamental fact makes speeding up the calculation of cumsum() challenging.

Proposed optimized uncertainty calculation algorithm

Result obtained

I now have a theoretical (i.e. not implemented nor tested) way of making cumsum() linear in time (and therefore memory, because building a quadratic memory structure takes quadratic time), and also of speeding up more general calculations (complicated expressions, but also calculations like cumsum(…)[::-1] even though its terms appear in a reverse order compared to cumsum(…), which makes some optimization schemes fail).

Overview of the proposed accelerated framework for uncertainties

The accelerated framework for calculating uncertainties revolves around a proposed caching algorithm, combined with various overall design choices (lazy evaluations for learning as much as possible for planning optimizations, algorithm that accommodates read-only data,…).

Fundamental assumption

The accelerated framework is based on the assumption that most calculations involve only few correlated variables.

Overview of some design choices

  • Lazy evaluations of the variance yield richer possibilities for optimizations, since the structure of multiple interconnected expressions can be known before performing most of the calculations.

  • The variance of each expression is first (lazily) calculated "naively" by ignoring correlations and calculating the proper linear combination of its term variances (after the linearization from linear error propagation theory). Such a calculation has the advantage of being fast. The result is correct except for the inclusion of correlations.

  • This implies that it is convenient to prevent the error on variables from being updatable. Up to uncertainties 3.0.1 included, f = a+b; print f; a.std_dev = …; print f prints two different errors for f. Offering the same feature in the naive-variance framework would mean marking many variances as incorrect when some variable is updated, etc., which would be a lot of work for something which is not an obviously desirable feature (printing f twice and getting two different results despite not explicitly updating it is a side effect and is therefore a tad dangerous).

  • It is much more robust, convenient and cheap to only work with correct variances instead of correcting them "at the last minute" and of tracking what corrections should be applied (thus, calculating both x+a and x+b would require correcting x twice for correlations, etc.). I therefore decided that each sub-expression store its (correct) variance (this does not use much memory, and the variances are needed anyway) after it is calculated (infinite caching).

  • How to correct naive variances and include correlations? The assumption of few correlated variables led me to consider efficiently tracking the set of variables in sub-expressions: this allows the program to check that the naive calculation of the variance is correct because there are (often) no correlations.

    This set of variables can be efficiently obtained by updating the largest set of variables from all the sub-expressions of an expression (copying it first if it is read-only—but the caching algorithm below should often make it writeable), by adding the smaller sets of variables. In the case of cumsum() on independent variables, this is done in constant time (addition of a single new variable).

    The basic implementation of uncertainty calculations in uncertainties up to version 3.0.1 (included) is to instead calculate and store the linearized version of each calculated expression (f = f0 + alpha da + beta db + …, which gives var(f) = alpha^2 var(a) + beta^2 var(b) + …). If correlations are few and far between, this is more information than what is needed for calculating the variance, which takes up a lot of memory and takes a long time to build (this gives the quadratic time for cumsum()), and implies long calculation times (because using f in new expressions implies involves multiplying each of its terms by the same linear factor, which takes linear time). Such long calculations times are avoided by the approach proposed here. In contrast with the maintenance of linear expressions, managing sets of variables is computationally faster: adding a new term to an expression can be done in constant time, whereas building a new expression alpha a + b requires multiplying all the terms in a by alpha, which is linear in time (with respect to its number of independent variables).

    If common variables between sub-expression are found (thanks to the set of variables in each sub-expression) when linearly combining them through the calculation of the naive variance, then the naive variance should be corrected. This can be done relatively efficiently by fishing for the coefficients of the (hopefully few) relevant variables (only) by going down the tree of sub-expressions and using them for correcting the naive variance and obtaining the correct variance. This is basically what uncertainties 3.0.1 does when asked to calculate a variance, except that it implicitly calculates the coefficient in front of each variable for the variance in question.

    The variance correction can be calculated efficiently. In fact, the correct variance contribution of variable x (of standard deviation σ_x) to an expression is (sum_i a_i)^2 σ_x^2, where a_i is the linear coefficients of variable x due to sub-expression i (it is thus simply the product of the factor of x in sub-expression i by the weight of this sub-expression in the global expression). The naive variance, obtained by summing the variances of the sub-expressions, contains the contribution of variable x as sum_i (a_i σ_x)^2. The term to be added to the naive variance can thus be expressed in an efficient way by using the sum of all the coefficients: S = sum_i a_i, as [sum_i a_i * (S-a_i)] σ_x^2 (the number of terms is linear in the number of sub-expressions, instead of being quadratic, as in the naive formula for the correction, [sum_i sum_i≠j a_i * a_j] σ_x^2).

    An added bonus of working with sets of variables is that looking for the coefficient of a few specific variables can be speeded up thanks to the set of variables that some nodes might store (thus, there is no need to look for the coefficient of variable x if it is not in the variables of a sub-expression).

    Based on the assumption that correlated variables should be relatively few, it makes sense to cache the linear coefficients that each sub-expressions had to calculate: this makes future calculations faster and should not take too much memory. This could be an option, though (at the cost of more calculations if some coefficients must be recalculated). There is a risk, here, though: in the case of many correlated variables (which runs against my basic assumption), this could end up taking some more serious amount of memory (just a bit more than in uncertainties version 1 and 2, which did not use lazy calculations of linear coefficients).

  • How to efficiently obtain the set of variables needed when naively combining variances so as to be able to correct the final variance if needed? This is the subject of the next section.

Optimized calculation of sets of variables

The challenge

There is a major challenge in efficiently calculating the set of variables in each expression: indefinitely caching this set in each sub-expression would typically imply for cumsum() a quadratic data structure building time since the total number of variables in all the terms is quadratic (I could not find or imagine any data structure that would be linear in this case while allowing constant-time membership testing like sets). In order to tackle this challenge, I thus devised a caching scheme for the sets of variables that addresses the need to both limit copies (of sets of variables in sub-expressions) and limit recalculating the same sets over and over again (which would also typically result in a quadratic calculation time for cumsum() since each result contains 1, 2, 3, etc. variables).

Proposed caching scheme

Key principles

The sets of variables needed for deciding whether and how to correct a naive variance can be obtained efficiently through temporary caching: in fact, as seen above, a full storage of the variables in each sub-expression typically requires "long" data structure building times, so caches must be temporary. If a cache is too short lived, then calculations (of a set of variables) will have to be redone, which slows things down. If it is too long lived, then building all the caches will take too much time because this multiplies the amount of data cached (in the extreme case, everything remains cached, which is why uncertainties 3.0.1 is slow for cumsum()). The algorithm below is "exact" in that it simply exactly finds the needed time to live of a set of variables (in the form of a temporary cache).

The caching scheme proposed here can be described as a "data flow management" algorithm: the "data bits" are sets of variables (in expressions), and they flow through the graphs defining the expressions with variables, staying cached at nodes only as long as needed—not longer.

In order for the caching algorithm to be well adapted to the task at hand, it is important to know how many times a given list of variables will be needed: the proposed algorithm assumes that repeated requests for the variance of an expression require a single calculation of the variables common to the sub-expressions of an expression (these variables are necessary for correcting naive variances); this effectively fixes the number of times a given expression will require the variables of its sub-expressions: one time. The choice (described above) of caching forever the variance of an expression once it is calculated is a way of satisfying this constraint.

An important factor to take into account in the "exact" calculation of the time to live of caches is that not all needed calculations are known in advance: the user might build some interconnected expressions (like x = a, a+b, a+b+c, etc.), then ask for their variances, then build more expressions on top of it (like x + y). The algorithm must handle the updated expression structure so as to keep caching sets of variables exactly as needed for the new expression (even if this implies recalculating some sets of variables that looked dispensable and were destroyed just before the update). In other words, it is important to keep in mind that speeding up calculations means being able to not only speed up one uncertainty calculation (z = cumsum([a, b,…])) but also subsequent calculations that involve previous results (z + …)—from which caches might have already been emptied because the last calculation was not yet requested when z.std_dev was calculated.

A general algorithm: definitions

The proposed caching algorithm is general. It caches data bits (meaning a bit of data; here: sets of variables) that allow each node in a tree (here: mathematical expression) to calculate some result (here: a corrected variance) by combining the data bits of its sub-nodes, through a fixed, positive number of data bit requests to its sub-nodes (one request per sub-expression), independently of how many times the node result is requested itself (in uncertainties, the variance will be indefinitely cached). I will therefore use the general terms when describing the algorithm.

"Exact" caching algorithm

The exact caching algorithm revolves around the (exact) counting of nodes that will need the data bit cached in a given node, given all the currently known information (subsequent calculations of the result of a (new) node generally require more calculations). Each node is thus assigned a state that describe this number of known nodes waiting for the data bit, and a bit more (the set of states is a bit richer than this). Thus:

  • States 1, 2, 3, etc. mean "the data bit is still cached in the node, because there are that many nodes that should require it".

  • State 0 is special: it is the state of a node when it is created (as the parent of sub-nodes), and it means: "the data bit has not yet been requested (and therefore obtained) from the sub-nodes, and therefore the node's result is not available yet" (because the result of the node has not yet been requested). In this state, the data bit cache is therefore empty.

  • State EMPTIED should be the basic state of a node after a set of calculations is done: it is the state of a node that has given off its data bit to all its parent, and it means: "the cache was emptied, but the result is calculated already". Ideally, an EMPTY node should not be requested to give its data bit again, because it has to be recalculated. However, this might happen if more parents are added after a first set of calculations—this need is impossible to predict with certainty, but the lazy evaluation of results allows the algorithm to know as much as possible about multiple upcoming result calculations, which avoids recalculating data bits as much as possible.

The algorithm is essentially defined through the transitions between these states:

  • Again, creating a node sets its state to 0.

  • Adding a parent to a node in a non-EMPTIED state increases its state by 1 (meaning that there is one additional declared need for the data bit of the node).

    Doing this to a node in an EMPTIED state sets its state to 1 (the node had satisfied all the previous needs, but learns that there is one additional declared need—typically because new nodes are built on top of nodes whose result was previously already obtained), and all its sub-nodes (the ones it will send data bit requests to in order to recalculate its own data bit) should also perform the state transition associated with "adding a parent" (so as to learn that there is a new, additional need for their data bit downstream)—this is necessarily done recursively, by this definition of "adding a parent".

  • Responding to the data bit request of a parent node by giving it a reference to the data bit decreases the state by 1 (each parent node is required to only perform one such request—for instance by indefinitely caching the node result), unless the state is initially 1: in this case the last request is being made, and the node goes to the EMPTIED state. In order to limit data copies, the data bits returned before the last one (state initially at 1 known need) can be references but should be marked as read-only; the last data bit returned should be writeable, since the node does not need it anymore.

    No node should be in the 0 or EMPTIED state when a data bit request arrives because the number of upcoming requests is supposed to be encoded in the node's state (a numeric state represents the number of expected data bit requests). However, a proper implementation should gracefully handle this case. A warning about an internal bug could be issued, and the request should in any case be honored (by propagating it to its sub-nodes).

Conclusion on the caching algorithm

This "exact" algorithm for caching data bits uses as much information as possible from the structure of the tree where these data bits should flow (and built while result evaluations are lazily delayed), so as to keep track of the exact time to live of the data bit caches of its nodes. This time-to-live evaluation, combined with as much read-only data bits as possible, and some convenient writeable data bits (that can be used as the base for bigger data bits, like the set of variables of a large expression, obtained by adding a single variable to a previous large expression), should give a fast algorithm for obtaining data bits (the set of variables in each expression, in the case of uncertainties).

Conclusion on the global uncertainty calculation algorithm

The design choices and algorithm above should provide an additional boost to the speed of uncertainty calculations in uncertainties that goes beyond the specific case of cumsum(). This boost comes in particular from:

  • the "exact" caching algorithm for sets of variables…

  • … combined with their optimized aggregation (the largest set is extended with the smaller ones, and sets of variable are either not copied and read-only, or writeable with no need to be copied),

  • with the lazy calculation of uncertainties, which allows calculation plan optimizations to be performed,

  • with the calculation of corrected naive variances, which in effect gives a fast and small cache of key results,

  • and with the assumption that most expressions should have relatively few correlated variables, which allows variances to first be calculated quickly as approximations (before being corrected).

Expressions like cumsum() and cumsum(…)[::-1] for independent variables, which motivated this study, are, with this algorithm, linear in time and memory, which is a nice result since these quantities contain a quadratic total number of variables. Both calculations require the same amount of time because they have the exact same underlying tree structure. (The basic reason is that the calculation of a given sum a+b+c+… prompts the calculation of the variance of a, a+b, a+b+c, etc., with each variance being cached indefinitely after being calculated in constant time thanks to the the addition of a single variable at each step and the absence of correlations.) The algorithm above applies to more general expressions and should make the calculation of their uncertainty faster in the majority of cases than with uncertainties 3.0.1 (despite its massive speed-up compared to versions 1 and 2).

Current open points

The recursive algorithm cannot be directly implemented

Python's recursion limit is too small for performing variance calculations recursively (there can easily be more terms than the recursion limit). This can be handled by implementing a kind of call stack—maybe tailored to the kind of required calculations (combining the variances of sub-terms, calculating sets of variables and caching them as needed,…). Frames in the stack would typically be responsible for performing a certain calculation and then sending the results to the appropriate frames in the stack (e.g. sending the variables of a node to a parent node that will collect them later when its frame is executed).

Maybe does it make sense to have frames be implemented as functions, with additional attributes that contain the variables necessary to the execution of the function?

Performance of the calculation of some linear coefficients

I yet have to look closely a the the performance of "fishing for the linear coefficients of some specific variables common to sub-expressions", in terms of both memory and time performance. I should look at possible caching schemes, from no caching, to full caching, to partial caching. The performance should be evaluated in the context of various possible usage scenarios of the (temporarily) cached sets of variables (maybe an expression knows that it does not have variable x because its set of variables is still cached, maybe the sequence of access to these sets emptied the cache already).

Example calculations that could be checked for performance: a+a+a+…, a+b+c+d+…, a+b+c+ … +c+b+a (where each variable is independent), with x = cumsum(…), x+x and x+x[::-1].

Caution with read-only and writeable results

The implementation and use of read-only and writeable sets of variables should be done with care. For example, if the read-only version contains a reference to the same contents as that of the writeable version, updating the writeable version will update the read-only version, which is not what is intended. Maybe read-only should instead be "immutable"/"cannot/should not be mutated".

Optimization of common cases

In addition to being correct, the implementation should also minimize copies in typical cases (unary operators line cosine, and binary ones like division).

Conclusion

This seems to be a solution to the initial problem, and more generally a way of accelerating many of the calculations done by uncertainties.

It implies forgoing the possibility of updating the standard deviation of existing variables—as is possible in uncertainties version 3.0.1 and below.

Implementing it involves rewriting the way uncertainties are calculated, with a more intricate flow of information (on-demand, no-copy transfer of sets of variables between various expressions), a simulation of a call stack with values flowing between different function bodies,…),…

The calculation of derivatives should be less useful internally (in the approach above, there is no need to calculate the derivatives of a complex function whose linear expansion terms contain only independent variables, since their variances can be readily combined). However, it is useful for users to know where the total uncertainty comes from, so it should probably be kept as a feature (as it is directly needed by UFloat.error_components).

The new approach is closer to what a human calculator would do in order to save work (with a minimal amount of bookkeeping, with a maximum reuse of previous results, etc.).

lebigot avatar Feb 19 '17 22:02 lebigot

Would it be possible to use reverse autodifferentiation or some more advanced methods and speed things up? See maybe:

  • https://github.com/ageron/handson-ml2/blob/master/extra_autodiff.ipynb
  • https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html
  • http://videolectures.net/deeplearning2017_johnson_automatic_differentiation/

Is this fully adapted to step by step calculations like in Python?

lebigot avatar Apr 19 '20 12:04 lebigot

Sorry for the dumb question, but is there any sort of workaround available for computing the std_devs of a cumulative sum efficiently/in linear time as the code currently stands?

For context, my specific situation is that I have about 10 Variables, and then an array of ~6000 elements where each element depends on all of the Variables. After evaluating the cumulative sum of this array, trying to access all the standard deviations runs very slowly, as expected based on the explanation in this issue.

joshburkart avatar Feb 19 '24 16:02 joshburkart

I guess that custom code can run faster, in this particular case where the number of independent variables is small (10). The idea is to leverage the fact that we know that the cumulative sum is over numbers that depend on the same small number of variables, so as to use fast array operations:

  1. You would first build a 10x~6000 (NumPy) array with the partial derivatives of each of your ~6000 elements with respect to each of the variables (uncertainties gives you these partial derivatives with .error_components()).
  2. The partial derivatives of each of the ~6000 elements (with respect to the same variables) are simply the cumulative sum (along the axis of the elements), which is a fast calculation.
  3. You could then build back variables with uncertainties (with a simple matrix product between the partial derivative array and the original variables. Alternatively, if you are only interested in the variance of the cumulative sums, you could do a matrix multiplication of this partial derivative array with the variances of the variables.

I expect most of the speed gain to be in step 2 (fast array operation), and step 3 (same reason) if you are only interested in the final uncertainties (instead of combining the cumulative sums with other quantities, etc.).

PS: @joshburkart: your question wasn't dumb!! 😀

lebigot avatar Feb 19 '24 18:02 lebigot

@lebigot Thank you for the info! I actually looked into what I was doing a bit more and it seems that computing the standard deviations was slow due to how I was doing unit conversions (using pint), not because of uncertainties. When I do unit conversions differently, I'm assuming I'm still incurring O(N^2) runtime, but for my input sizes this only takes a second or so. Here's a MRE:

import numpy as np
import pint
import uncertainties
import uncertainties.unumpy

np.random.seed(1)

UNITS = pint.UnitRegistry()

variables = np.array([uncertainties.ufloat(np.random.random(), std_dev=np.random.random()**2) for _ in range(10)])

# Compute array and cumulative sum without units (pure NumPy array).
array_j = np.array([variables @ np.random.random(len(variables)) for _ in range(10000)])
cumsum_j = np.cumsum(array_j)

# Compute unit-full array and cumulative sum.
array = array_j * UNITS.J
cumsum = np.cumsum(array)

# Compute standard deviations of pure NumPy array, then convert units. Fast.
print(uncertainties.unumpy.std_devs(cumsum_j) * (1. * UNITS.J).to(UNITS.kWh).m)
# Convert units, then compute standard deviations. Hangs.
print(uncertainties.unumpy.std_devs(cumsum.to(UNITS.kWh).m))

joshburkart avatar Feb 19 '24 19:02 joshburkart

Interesting. Good to hear that uncertainties is fast enough for you.

The runtime should actually be O(number of variables x number of terms to sum cumulatively), in your case.

The O(N^2) runtime is for N independent variables that you sum together (this is the memory footprint when each cumulative sum term keeps track of which variables it depends on).

lebigot avatar Feb 19 '24 20:02 lebigot