OpenRefine
OpenRefine copied to clipboard
toString() doesn't cast int to float
I had a column in my projects with numerical values (imported as strings) that I wanted to serialise into a standard format. I used value.toNumber().toString("%.4f") and found that this mostly worked but it fell over with an error when the source value was represented as an integer. So it seems passing an integer into the format "%f" gives an error:
toString(1,"%f)
Error: java.util.IllegalFormatConversionException: f != java.lang.Long
The workaround is to force a cast to float by multiplying the value by 1.0, but I think this feels counter-intuitive. Given that OR only gives a single function toNumber() (not toFloat() and toInteger() ), I think it's appropriate for the toString function to apply whatever type casting is required for the format.
OpenRefine version 3.8.2
I'd rather not do automatic (unknown to user) conversions - which causes other issues downstream when users become unaware of precisely the conversions taking place. It's better to show an error (maybe improve the error and offer a hint to "Cannot convert Integer to Long; You might want to first use toFloat()"?
Alternatively, not transforming data when there's an error, can be done - which is actually a best practice - always checking the radio button "store error" - which in my opinion should actually BE THE DEFAULT, since Expression preview doesn't give you ALL the rows that might be producing an error, and if there was an error, the original value is kept and not changed, and hence folks think the transform worked when it did not. (broken feature in this case, so new issue #6943 )
Anyways:
if(
isError(value.toNumber().toString("%.4f")),
value,
value.toNumber().toString("%.4f")
)
But we try to make things simple for folks and I think a better idea might just be a chomp() like capability for reducing number precision to some decimal places without a lot of format syntax fuss...
precision(4) would format to 4 decimal places.
But I see in Commons Math 3 there's the idea of precision() https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/util/Precision.html but no idea where this moved to in Commons Math 4.
Precision.round(value, 4);
which actually can take 3 parameters: round(value, scale, rounding method)
or decimalFormat() , whatever named, be useful?
DecimalFormat df = new DecimalFormat("##.####");
System.out.println(df.format(PI));
@tfmorris could my proposed new GREL function be implemented just for Numbers and throw errors when not applicable? Other ideas? Hmm, maybe newer JDKMath library has something to ease the work? I see this legacy trunc function: https://commons.apache.org/proper/commons-math/javadocs/api-4.0-beta1/org/apache/commons/math4/legacy/core/dfp/Dfp.html#trunc(org.apache.commons.math4.legacy.core.dfp.DfpField.RoundingMode)
My preference would be to expand the current GREL round() to take those 3 parameters like Commons Math's Precision.round() method.
Not transforming integers into the format I want (rather than raising an error) doesn't really simplify the task - I just need all numbers in a consistent textual format, so as it stands I think the best way is still
toString(value*1.0,"%f)
especially since it's never wrong (and therefore not something to consider an 'error') to represent an integer with a decimal place.
Although I agree that automatic conversions can be baffling, I do think int-to-float casting is so common that it is intuitive. From the GREL documentation on 'Math functions':
If either operand is a floating point number, they both get promoted to floating point and a floating point result is returned.
So promoting an int to a float for the purpose of converting with the %f format feels like the same logic. Moreover, it's not obvious (or documented!) that when you convert a whole column with toNumber() that you may end up with a mix of types that you then need to treat differently. I think it behooves a 'refining' intention to be able to treat all of the 'numbers' thus created in the same way.
All that being said, I think providing a more sophisticated round() function would be good too, though it would have to have the (possibly peculiar) property of having a a different output type depending on the precision parameter (i.e. int only if precision is 0).
I'd rather not do automatic (unknown to user) conversions
Like it or not, we've doing this since day 1 in our arithmetic processing.
We probably want "<numeric string>".toNumber().toString() to return the original string, but that's problematic currently because toString() uses the default local while toNumber().
We probably also ideally want toNumber() to produce consistent data types for all cells in a column, but that's not what it currently does.
While we could theoretically introspect the format string to look for the floating point conversion characters (xXeEfgGaA), we'd also need to reliably parse the %[argument_index$][flags][width][.precision]conversion syntax. Not really terribly difficult, but it just seems like a can of worms that has limited value in opening given the existence of an easy workaround.
Certainly if we redesign this particular corner of GREL, we should see what we can do to improve things. My vote would probably be for fixing ToNumber to not return variable data types.
We probably also ideally want toNumber() to produce consistent data types for all cells in a column, but that's not what it currently does.
Oh gosh, didn't realize that! Messy data stays...somewhat messy because of us and our inconsistency here? That definitely doesn't help users trying to align data values and what OpenRefine users expect.
My vote would probably be for fixing ToNumber to not return variable data types.
@tfmorris Agree. Definitely we should fix that first and be consistent.
Another example of somewhat counter-intuitive numerical behavior is the sum() function. Although the argument's elements are effectively always promoted to floats (which is logical enough), it may not be not obvious that the result can never be an int:
toString(sum([1,2]), "total: %d")
Error: java.util.IllegalFormatConversionException: d != java.lang.Double
The only solution here would be to use round() or floor() after the sum to cast back to integer.
On the whole, I think it's a good thing that the user doesn't need to know the underlying type of 'numbers' in most situations. It doesn't matter, for instance, if your column is integers, floats, or a mixture in order to make a working numerical facet on it, and no one thinks this is spooky (or expects errors). So I'd still vote for toString() to just be similarly intelligent and impervious to nuisance errors. Something like:
- For a float representation like %f, always promote ints to floats (same as x*1.0)
- For an integer representation like %d, cast to int (using floor(), say) if floor(x)*1.0==x , so an error is only raised if x has a fractional part (and then the user will know they need to cast with tools like round(), floor(), or ceil()).
So I actually don't quite agree with
My vote would probably be for fixing ToNumber to not return variable data types.
because there are situations where you have only ints or only floats and you want to keep them that way for your further calculations. If we made toNumber always output floats, for instance, then users would have to start putting round() or floor() after a call to toNumber() just to ensure integer processing.
I like the current way that GREL appears to have a generic 'number' type which works logically wherever you use it, so I'm in favor of making the other tools/functions that use numbers be somewhat impervious to the underlying Java type.
In our function discussions, I'm always looking around at what other projects do, just to get a feel for alternative viewpoints.
What if OpenRefine began to support NaN like other programs do? How then might this issue be redescribed?
https://docs.pola.rs/user-guide/expressions/missing-data/#null-and-nan-values
https://docs.pola.rs/user-guide/expressions/missing-data/#not-a-number-or-nan-values