fastnumbers
fastnumbers copied to clipboard
Proposal: change behavior of isfloat with respect to treatment of float("nan")
Is there any param in is_float
to check for nan and return False?
I am trying:
x = float("nan")
isfloat(x, allow_nan=False)
True
Currently, no - the allow_nan
option is only for strings.
The way I have found to check for nan quickly is with x != x
- which is only true for nan.
I would bet that x != x
would actually be faster than calling a function because of the overhead of calling functions in python.
Thanks @SethMMorton that works.
Will be possible to add this feature?
I'm on the fence about this, for the reason that technically float("nan")
is a float. But, if I were to implement this I would do it like so: I would change the default value of allow_nan
(and allow_inf
) to the equivalent of None
. If the value is None
, the behavior is the same as today ("nan"
returns False
, but float("nan")
returns True
). Then, if the user explicitly sets this this option to False
, it will return False
, for any form of NaN
.
Do you mean that allow_nan
will accept None
, False
, and True
?
So, when implementing a function in C, it's possible to actually have an execution path for when an argument is not given (as opposed to set to None
), by defining the initial value to NULL
, and then checking to see if it is still NULL
after arguments are passed. In this case, I was proposing it would still only accept False
and True
, but the default mode would behave as it currently does.
HOWEVER, I still have misgivings about this, especially after work done in #38 - or, more accurately, I do not know how to properly implement/name/document this to avoid confusion.
Here is the current behavior, and why I am not sure how to approach the problem.
-
fast_float
- This has the
nan
option that can be substitute eitherfloat("nan")
input or"nan"
- This has the
-
isfloat
- The
allow_nan
only toggles how"nan"
is parsed, and as a resultfloat("nan")
always returns True.
- The
-
query_type
(not yet released, part of PR #38)- The
allow_nan
only toggles how"nan"
is parsed, and as a resultfloat("nan")
always returns True.
- The
(Assume the same goes for the infinity-related option)
So, in the existing API there is already a disconnect between fast_float
's nan
and isfloat
's allow_nan
- if using a user wanted to identify isfloat
any value that would be substituted with nan
they would be unable because of how isfloat
behaves with float("nan")
. This gets even more complicated with query_type
. Ideally the API and output of query_type
would be consistent with isfloat
or isint
, but if we disable identification of float("nan")
as a float with allow_nan
, then what type would it return?
At the end of the day, I would like the behavior of all three of these functions to behave consistently. Do you have any suggestions to ensure this?
Copied from (a now-deleted) comment by @argenisleon in #39 that was intended from this issue.
I think fastnumbers could handle 'nan' as True. From the readme
>>> # Customize handling for nan or inf
>>> isfloat('nan')
False
>>> isfloat('nan', allow_nan=True)
True
In this way fastnumbers will answer True if a string to number is float.
I think the allow_nan
param should no be necessary
I have cleaned/moved up some conversation that belonged as part of #39.
@argenisleon The issue with the proposal to have isfloat
always handle "nan"
is that it removes flexibility from users. Not everybody has the same goal in mind when using these functions, and I can see use cases where people actively do not want to introduce NaN in to their dataset.
When I think about this request as well as the requests in #39 and #40, it boils down to giving the user more flexibility. I think this is a good idea. I also believe that I have a solution that will address all three issues that will not require changing the internal logic - just modifying the user interface. Further, this can be introduced as the new, improved interface while keeping the old interface for backwards compatibility (though I would probably undocument the old interface).
Here would be the proposed new API for fast_float
(the **kwargs
is for the backwards-compatibility stuff). The text of the API is first-draft and rather verbose.
def fast_float(
x,
*,
nan=ALLOWED,
inf=ALLOWED,
on_fail=INPUT,
on_typerror=RAISE,
allow_underscores=True,
**kwargs
):
"""
Quickly convert input to a `float`.
<more text>
Parameters
---------------
x
The input you wish to convert to a `float`.
nan : optional
Control how NaN is interpreted/handled. The default is the enum *ALLOWED*, which
indicates that both the string "nan" or the float NaN are accepted. Other valid values
are *STRING_ONLY* to indicate that only "nan" is accepted, *NUMBER_ONLY* to
indicate only NaN is accepted, *INPUT* to indicate that if given this value it should
be returned as-is, *RAISE* to indicate a *ValueError*, should be raised, a callable
accepting a single argument that will be called with the input to return an alternate
value, or a default value to be returned instead of NaN.
inf : optional
Control how INF is interpreted/handled. The default is the enum *ALLOWED*, which
indicates that both the string "inf" or the float INF are accepted. Other valid values
are *STRING_ONLY* to indicate that only "inf" is accepted, *NUMBER_ONLY* to
indicate only INF is accepted, *INPUT* to indicate that if given this value it should
be returned as-is, *RAISE* to indicate a *ValueError* should be raised, a callable
accepting a single argument that will be called with the input to return an alternate
value, or a default value to be returned instead of INF.
on_fail : optional
Control what happens when an input string cannot be converted to a float. The
default is the enum value *INPUT* which indicates that the value should be
returned as-is. Other valid values are *RAISE* to indicate a *ValueError* should
be raised, a callable accepting a single argument that will be called with the input
to return an alternate value, or a default value to be returned instead of the input
string.
on_typerror : optional
Control what happens when the input is neither numeric nor string. The default is
the enum value *RAISE* which indicates that a *TypeError* should be raised. Other
valid values are *INPUT" to indicate the input should be returned as-is, a callable
accepting a single argument that will be called with the input to return an alternate
value, or a default value to be returned instead of the input value.
"""
Here would be the proposed new API for isfloat
def isfloat(
x,
*,
nan=ALLOWED,
inf=ALLOWED,
consider=*,
strict=False,
allow_underscores=True,
**kwargs
):
"""
Quickly determine if a string is a float.
<more text>
Parameters
---------------
x
The input you wish to test if it is a float.
nan : optional
Control if and in what form NaN is interpreted. The default is the enum value *ALLOWED*,
which indicates that either "nan" or NaN will return *True*. Other allowed values are
*STRING_ONLY*, which indicates that only "nan" will return *True*, or *NUMBER_ONLY*,
which indicates that only NaN will return *True*, or *DISALLOWED*, which means neither
will return *True*.
inf : optional
Control if and in what form INF is interpreted. The default is the enum value *ALLOWED*,
which indicates that either "inf" or INF will return *True*. Other allowed values are
*STRING_ONLY*, which indicates that only "inf" will return *True*, or *NUMBER_ONLY*,
which indicates that only INF will return *True*, or *DISALLOWED*, which means neither
will return *True*.
consider : optional
Control the data types that may be interpreted. By default but string and numeric input
may be considered. If given *STRING_ONLY*, then only string input may return *True*.
if given *NUMBER_ONLY*, then only numeric input may return *True*.
strict : bool, optional
Control what string values may be considered as float. If *True*, then the string "56.0"
would return *True* but "56" would return *False*. If *False* (the default), then both of
the strings "56.0" and "56" would return *True*.
"""
With the above changes, your request becomes possible:
>>> from fastnumbers import isfloat, DISALLOWED
>>> isfloat(float("nan"), nan=DISALLOWED)
False
The request from #39 becomes possible:
>>> from fastnumbers import isfloat
>>> isfloat("56", strict=True)
True
The request from #40 becomes possible:
>>> from fastnumbers import fast_float, INPUT
>>> fast_float(None, on_typerror=INPUT)
None
@argenisleon I am on vacation next week and would like to add these changes during that time. I will implement all your requests as I have described above unless you indicate the proposal does not meet your request somehow.
@SethMMorton Just wanted to chime in and +1 to this feature/enhancement. The new API sounds very intuitive to me.
Would you be able to extend this functionality to the isreal
and fast_real
functions? I have a current use case where I'm testing to see which elements of a numpy input array of Object dtype are finite values, essentially making a boolean mask. The Object dtype makes it so that each value in the numpy array can be a float or an int. The current implementation I have is along the lines of np.vectorize(lambda value: isreal(value) and np.isfinite(fast_real(value)))
, which is slowed down by all the excessive function calls.
I tested with np.vectorize(isreal)
and saw a 4-5X speed up, but because of the Object dtype, nan and inf values are considered floats.
@wiltonwu Yes, the proof-of-concept that I showed here used the float
functions, but I would apply the API to all functions - I hate disjointed APIs.
@wiltonwu I'm also interested to hear that you found just giving the function to np.vectorize
gives a nice performance improvement. I would be interested in a PR that adds your findings on this to the documentation.
@SethMMorton sorry for the late response. Yes, the request still handles my needs. Thanks
@wiltonwu I'm also interested to hear that you found just giving the function to
np.vectorize
gives a nice performance improvement. I would be interested in a PR that adds your findings on this to the documentation.
Oh the performance improvement was between (1)np.vectorize(lambda value: isreal(value) and np.isfinite(fast_real(value)))(array)
and (2)np.vectorize(isreal)(array)
, which makes me suspect that np.vectorize(isreal)(array, nan=DISALLOWED, inf=DISALLOWED)
will also be faster than implementation (1).
Tangential to the discussion in this thread, but the conversation has me thinking that it may be beneficial to implement the "checking" functions as numpy universal functions. My hypothesis is that leveraging numpy broadcasting will be significantly faster than np.vectorize
. If you think that's valuable, I can open a separate issue for that proposal.
Just so you know - I am still planning to do this. Unfortunately, open source does not take a priority over family, so this has been pushed aside.
Lol - I realize it's been two years, but I have started working on this again.
I have thought a lot about backwards compatibility of this, and I have decided that instead of adding more kwargs to existing functions and then un-documenting old kwargs, I will create new functions with the new desired behavior and keep the old as-is, never to be removed. I will mark them as deprecated in the documentation.
The new functions will have names that I feel better reflect their intent, and I wish they were the names I had used originally. The fast_*
function updates will be named to try_*
, and the is*
function updates will be named check_*
.
I hope to release this and some performance updates in the next month or so.
@argenisleon I realize it's been a couple of years since you filed these issues, and you may have found other solutions, but I have finally released them as part of version 4.0.0 if you are still interested.