openturns icon indicating copy to clipboard operation
openturns copied to clipboard

Missing check type in CompositeDistribution ctors

Open sofianehaddad opened this issue 5 years ago • 5 comments

Hi,

The CompositeDistribution class allows creating distribution of form g(X) with :

  • g a scalar function
  • X a distribution

However, if X is of type CompositeDistribution, we should improve the ctors and keep antecedent attribut of type non composite.

In the following example,

import openturns as ot
g = ot.SymbolicFunction(['x'], ['sin(x) + cos(x)'])
distY = ot.CompositeDistribution(g, ot.Normal(1.0, 0.5))
h = ot.SymbolicFunction(['x'], ['abs(x)'])
distZ = ot.CompositeDistribution(h, distY)

distZ.getAntecedent() returns a CompositeDistribution and seems to be less performant than defining

distW = ot.CompositeDistribution(ot.ComposedFunction(h, distZ.getFunction()), 
                                                     distZ.getAntecedent())

Here, we have a better reading of the distribution components

sofianehaddad avatar May 11 '20 13:05 sofianehaddad

I made a small benchmark:

import openturns as ot

U = ot.Uniform(-14.0, 14.0)
f1 = ot.SymbolicFunction("x", "sin(x)")
f2 = ot.SymbolicFunction("x", "abs(x)")
d2 = ot.CompositeDistribution(f2, ot.CompositeDistribution(f1, U))
d3 = ot.CompositeDistribution(ot.ComposedFunction(f2, f1), U)
d4 = ot.CompositeDistribution(f1, ot.CompositeDistribution(f2, U))
d5 = ot.CompositeDistribution(ot.ComposedFunction(f1, f2), U)
from time import *
size = 1000000
sample = U.getSample(size)
# D2
print("#"*50)
t0 = time()
pdf2 = d2.computePDF(sample)
t1 = time()
print("d2 pdf=", size / (t1 - t0), "evals/s")
t0 = time()
cdf2 = d2.computeCDF(sample)
t1 = time()
print("d2 cdf=", size / (t1 - t0), "evals/s")
t0 = time()
rng2 = d2.getSample(size)
t1 = time()
print("d2 rng=", size / (t1 - t0), "rngs/s")
# D3
print("#"*50)
t0 = time()
pdf3 = d3.computePDF(sample)
t1 = time()
print("d3 pdf=", size / (t1 - t0), "evals/s")
t0 = time()
cdf3 = d3.computeCDF(sample)
t1 = time()
print("d3 cdf=", size / (t1 - t0), "evals/s")
t0 = time()
rng3 = d3.getSample(size)
t1 = time()
print("d3 rng=", size / (t1 - t0), "rngs/s")
# D4
print("#"*50)
t0 = time()
pdf4 = d4.computePDF(sample)
t1 = time()
print("d4 pdf=", size / (t1 - t0), "evals/s")
t0 = time()
cdf4 = d4.computeCDF(sample)
t1 = time()
print("d4 cdf=", size / (t1 - t0), "evals/s")
t0 = time()
rng4 = d4.getSample(size)
t1 = time()
print("d4 rng=", size / (t1 - t0), "rngs/s")
# D5
print("#"*50)
t0 = time()
pdf5 = d5.computePDF(sample)
t1 = time()
print("d5 pdf=", size / (t1 - t0), "evals/s")
t0 = time()
cdf5 = d5.computeCDF(sample)
t1 = time()
print("d5 cdf=", size / (t1 - t0), "evals/s")
t0 = time()
rng5 = d5.getSample(size)
t1 = time()
print("d5 rng=", size / (t1 - t0), "rngs/s")

and the result is:

##################################################
d2 pdf= 864756.0941917921 evals/s
d2 cdf= 1126075.6209411258 evals/s
d2 rng= 3666769.533923902 rngs/s
##################################################
d3 pdf= 437918.6236194139 evals/s
d3 cdf= 670671.2142362593 evals/s
d3 rng= 4423615.373744149 rngs/s
##################################################
d4 pdf= 951401.5142837117 evals/s
d4 cdf= 1309934.8577946094 evals/s
d4 rng= 3655313.4156140466 rngs/s
##################################################
d5 pdf= 423262.3207686145 evals/s
d5 cdf= 671111.5114803113 evals/s
d5 rng= 4391964.355834324 rngs/s

If we compare the results for d2 and d3, then d4 and d5, we see that a composite distribution based on a composite distribution is much more efficient than a composite distribution based on the composed function for the PDF and CDF (essentially 2x faster), and slower for the sampling (-18%). So no definitive conclusion here: it all depends on the usage you have of your composite distribution. If you want to build its orthonormal polynomial family the PDF speed is crucial, if you want to use the associated isoprobabilistic transformation the CDF speed is crucial. But if you want to sample it you don't care about PDF or CDF speed...

regislebrun avatar May 17 '20 17:05 regislebrun

Let us come back to this. Do the results mentioned related to any parallelism mechanism of the symbolic function? Within the 1.18 on my laptop, I get the following results:

##################################################
d2 pdf= 295771.7089972465 evals/s
d2 cdf= 386665.16707260185 evals/s
d2 rng= 1698046.5378393217 rngs/s
##################################################
d3 pdf= 172208.5059827033 evals/s
d3 cdf= 243021.40507642683 evals/s
d3 rng= 1664697.0105891505 rngs/s
##################################################
d4 pdf= 327039.0290529803 evals/s
d4 cdf= 473602.84583741287 evals/s
d4 rng= 1734277.0689285889 rngs/s
##################################################
d5 pdf= 166687.9548663672 evals/s
d5 cdf= 250604.91246820838 evals/s
d5 rng= 1605765.3287024489 rngs/s

Seems that current implementation is definitely better in terms of CPU performance. Could you please check with your laptop & current implementation ?

sofianehaddad avatar May 09 '22 11:05 sofianehaddad

Here are the result on my laptop using current master: ################################################## d2 pdf= 1077999.7332182592 evals/s d2 cdf= 1404892.5840847993 evals/s d2 rng= 5230548.980777816 rngs/s ################################################## d3 pdf= 529994.7850978897 evals/s d3 cdf= 772879.6044411006 evals/s d3 rng= 5413129.695330256 rngs/s ################################################## d4 pdf= 1212368.1493446056 evals/s d4 cdf= 1728886.2599777826 evals/s d4 rng= 5232611.000910712 rngs/s ################################################## d5 pdf= 512198.20169670146 evals/s d5 cdf= 790148.7610031638 evals/s d5 rng= 5268855.4574050065 rngs/s

and using the 1.18 version: ################################################## d2 pdf= 1089805.8056924066 evals/s d2 cdf= 1447121.6132338892 evals/s d2 rng= 5456033.592369911 rngs/s ################################################## d3 pdf= 558224.63892045 evals/s d3 cdf= 814484.4215414544 evals/s d3 rng= 5608610.47854267 rngs/s ################################################## d4 pdf= 1240370.0849387874 evals/s d4 cdf= 1779555.6015662688 evals/s d4 rng= 5567913.935900618 rngs/s ################################################## d5 pdf= 540968.9116324134 evals/s d5 cdf= 836865.5451672124 evals/s d5 rng= 5643399.087487874 rngs/s

but what is clear is the 2x penalty involved in composing the functions instead of the composite distribution, I mean: CompisiteDistribution(f, CompositeDistribution(g, X)) is faster than CompositeDistribution(ComposedFunction(f,g), X).

regislebrun avatar May 09 '22 12:05 regislebrun

yes this is the conclusion and it is the "opposite" of what I had in mind

sofianehaddad avatar May 09 '22 12:05 sofianehaddad

yes, I expected the penalty of the two CompositeDistribution being higher maybe with more complex functions / antecedent ?

jschueller avatar May 09 '22 16:05 jschueller