uncertainties
uncertainties copied to clipboard
New formatting option with implicit uncertainty?
From a user:
May I suggest you add an additional formatting specifier to "uncertainties"?
In many cases, particularly in tabular presentation of data, one wants to simply round according to the uncertainty and let the "significant figures" communicate the implied precision. It is not, of course, as elegant or as explicit as the shorthand notation, but sometimes it is desirable.
Then one will have the option of presenting the number (3.1415926,0.03) as … 3.14
One possibility would be to use the precision modifier "u" as you already do, but have it affect the number of digits past the one which is constrained by the error, i.e. .1u would cause exactly the same as the data value printed by the existing default format before the +/-, etc. This preserves consistency with digits displayed across the various formats. For tunability, they can modify the std_dev. In fact, another utility function might be useful where you can modify this on the fly, i.e.: x.std_mult(2.0) would return a ufloat with the existing std_dev multiplied by 2, so one could pass such a "temporarily" modified value into the formatter, e.g.: '{:.2u}'.format(x.std_mult(2.0))
After discussing with this user, I am in favor of doing the following:
- Add a formatting option that requests to not print the uncertainty, but to print the nominal part. This option could be combined with most existing formats. The need arises for instance when producing tables without uncertainty but with relatively meaningful numbers. An advantage of this approach is that the format of the nominal part is not surprising, as it is similar to an existing format. A typical usage would be to use it with the
.1uformat, as this prints the uncertainty with a single significant digit and truncates the nominal value at the same location. - Add a formatting option that only prints the uncertainty, that functions in a way similar to the option that only prints the nominal value. The name of the option could be the lowercase version of the "nominal value only" option. This can be useful for separately printing the uncertainty, for instance in a table.
- Related to formatting, but not restricted to it: give a new method to UFloats that returns a number with uncertainty with the same nominal value, but a random part multiplied by some factor (this is more convenient than
ufloat(x.nominal_value, 2*sqrt(x.std_dev))). With this, users can for instance easily print numbers with an uncertainty at 2 standard deviations, or even tune the digits printed when only the nominal value is printed (for instance by temporarily multiplying the uncertainty by 10). PS: does it make sense to instead introduce the concept of "number with an uncertainty which is a certain factor of the standard deviation of another number with uncertainty", which is correlated with the same variables as the original variable? This might be useful for people who need something like a "95 % confidence interval" (if they know enough about the shape of the random distribution). Warning: implementing this efficiently can currently be done, but might be hard with a new (e.g. faster) implementation of the uncertainty calculation.
Concretely, the option could be Q (print only nominal value) and q (print only standard deviation), as this chooses which "quantity" is printed. V and v would work too (defines which "value" is printed).
The method could be scaled_uncert(). Thus x.scaled_uncert(2.0) would always have an uncertainty which is a twice as big as the uncertainty of x (even if the uncertainty on x changes because the uncertainty of one of the variables is changed).
I have written a function for my own use to solve this problem:
def justif_prec(uf: ufloat) -> str:
return ('{0:.' + str( max(1,math.floor( -np.log10( max(1e-9,uf.std_dev) ) )) ) + 'f}'
).format(uf.nominal_value)
myrand = ufloat(3.6125837,0.035)
print(myrand)
3.613+/-0.035
print(justif_prec(myrand) )
3.6
As for why I feel this function is required:
def sample_from(uf,ns) : return list(np.random.normal(uf.nominal_value, uf.std_dev, ns))
['{0:.1f}'.format(s) for s in sample_from(myrand,8 )]
['3.6', '3.6', '3.6', '3.6', '3.6', '3.6', '3.7', '3.6']
print(justif_prec(myrand) )
3.6
Whereas using the ppg rules for significant figures:
def ppg_prec(uf) : return str(uf).split('+/-')[0]
print(ppg_prec(myrand))
3.613
['{0:.1f}'.format(s) for s in sample_from(myrand,8 )]
['3.608', '3.568', '3.554', '3.641', '3.596', '3.581', '3.553', '3.556']
The 13 in 3.613 is misleading information if you want to display the ufloat, especially if the ufloat has been estimated from data, and especially if you are comparing two ufloats by looking at them. The only digits the reliably predict the actual value are the first two (3.6) as displayed by justif_prec.
I can see why the ppg rules display these extra digits; in physics people will often manually copy a value to be used in further calculations or otherwise communicated, in which case moving two misleading digits around is a good idea. However if an uncertain number is to be displayed and the extra precision is being stored in a numerical variable, it is a bad idea to display these two extra, misleading digits.
I think the justif_prec I have written could be improved, it's just a first attempt but it gets at what I think is needed.
As a further example:
baseline_result = ufloat(8.66448136457,0.0353459)
result_with_new_feature = ufloat(8.671507820,0.035175)
print(baseline_result)
8.664+/-0.035
print(result_with_new_feature)
8.672+/-0.035
Trap: "Seems like a small improvement. It changed two digits" Reality: "I was fooled by spurious precision"
from scipy.stats import norm
def probability_greater_than(uf1, uf2):
mean_diff = uf1.nominal_value - uf2.nominal_value
std_diff = (uf1.std_dev**2 + uf2.std_dev**2 )**0.5
return 1-norm.cdf(0, loc = mean_diff, scale = std_diff)
probability_greater_than(result_with_new_feature, baseline_result)
0.556
Thanks for sharing @RokoMijic.
Any modification of the way numbers with uncertainty are printed must be included in the general code for formatting, which is very complicated—in part because multiple options can be given together (like LaTeX output and some output precision). So until this is done, you can indeed use your function.
Now, comparing two results to see by how much they differ should be done by subtracting them, as they might be correlated. This means that the code for probabilty_greater_than() is incorrect: it should depend only on uf2-uf1 (std_diff = (uf2-uf1).std_dev, etc.). This won't change the result in your particular case, but will in general. probabilty_greater_than() also implies that the random part is gaussian, which is not a given—thought it might be a first reasonable assumption.
I still stand by the things that I wrote I am in favor of doing, above: I understand that they are consistent with your need, no?
@lebigot Well, the code is correct for it's purpose: to work out the probability in that particular case (not in general).
But the use-case I have in mind is basically getting more people to do data science with the IMO excellent uncertainties module you have here.
One key sticking point I have seen is that 0.3618 +/- 0.0462 is information that is very difficult for a human to interpret correctly at a glance.
There is too much information there, and most of it is misleading.
When you're looking at a big list or table of uncertain numbers, all that irrelevant information overloads you and you can't really make sense of it.
The example I gave was there to demonstrate that these ppg representations can be very misleading.
regarding printing, latex and formatting: yes, I realise that that's going to be complicated.
It may be possible to sidestep that by just supplying a method called .jp or something that turns a ufloat into a string. Then if you want to print out the values of ufloats in a human-friendly format, you print out x.jp().
Of course it would be ideal to have a formatting option, but you know better that I do how that would work!
regarding this:
Add a formatting option that requests to not print the uncertainty, but to print the nominal part.
It doesn't really seem to get at the problem here (though it's nice to have), because it prints a FIXED number of decimal places that have no relation to the actual uncertainty. Yes?
The idea would actually to "combine it with most existing formats". Thus, your examples could be almost covered by using the existing .1u format: "{:.1u}".format(x) and adding an option to only print the nominal value (as I suggested).
Now, I understand that you want to have only correct digits printed. This is not possible in uncertainties (with, say, a hypothetical .0u format) because this is not possible in principle: take a number like 3.00±0.01: 3.0 is certainly not the list of correct digits, as there is a high chance that the real number start with 2.9…. Thus, it is generally not possible to only print correct digits. uncertainties can thus print up to the last digit that can vary (.1u), up to the last 2 digits that can vary (.2u), etc., because there is no robust concept of "digits that cannot vary".
Right, it isn't possible using conventional decimal notation to display only the correct digits (correct with some probability).
Can you point me to the documentation for exactly what .1u does because I assumed incorrectly that it prints 1 decimal irrespective of the uncertainty? Or is this use of .1u just a plan?
By the way, I did spend some time trying to solve the problem of wanting guaranteed digits and came up with some new notation ;)
http://uncertainties-python-package.readthedocs.io/en/latest/user_guide.html#printing