fastnumbers icon indicating copy to clipboard operation
fastnumbers copied to clipboard

Proposal: change behavior of isfloat with respect to treatment of float("nan")

Open argenisleon opened this issue 4 years ago • 17 comments

Is there any param in is_float to check for nan and return False?

I am trying:

x = float("nan")
isfloat(x, allow_nan=False)
True

argenisleon avatar May 14 '20 19:05 argenisleon

Currently, no - the allow_nan option is only for strings.

The way I have found to check for nan quickly is with x != x - which is only true for nan.

SethMMorton avatar May 14 '20 20:05 SethMMorton

I would bet that x != x would actually be faster than calling a function because of the overhead of calling functions in python.

SethMMorton avatar May 14 '20 20:05 SethMMorton

Thanks @SethMMorton that works.

Will be possible to add this feature?

argenisleon avatar May 16 '20 00:05 argenisleon

I'm on the fence about this, for the reason that technically float("nan") is a float. But, if I were to implement this I would do it like so: I would change the default value of allow_nan (and allow_inf) to the equivalent of None. If the value is None, the behavior is the same as today ("nan" returns False, but float("nan") returns True). Then, if the user explicitly sets this this option to False, it will return False, for any form of NaN.

SethMMorton avatar May 16 '20 17:05 SethMMorton

Do you mean that allow_nan will accept None, False, and True?

argenisleon avatar May 19 '20 06:05 argenisleon

So, when implementing a function in C, it's possible to actually have an execution path for when an argument is not given (as opposed to set to None), by defining the initial value to NULL, and then checking to see if it is still NULL after arguments are passed. In this case, I was proposing it would still only accept False and True, but the default mode would behave as it currently does.


HOWEVER, I still have misgivings about this, especially after work done in #38 - or, more accurately, I do not know how to properly implement/name/document this to avoid confusion.

Here is the current behavior, and why I am not sure how to approach the problem.

  • fast_float
    • This has the nan option that can be substitute either float("nan") input or "nan"
  • isfloat
    • The allow_nan only toggles how "nan" is parsed, and as a result float("nan") always returns True.
  • query_type (not yet released, part of PR #38)
    • The allow_nan only toggles how "nan" is parsed, and as a result float("nan") always returns True.

(Assume the same goes for the infinity-related option)

So, in the existing API there is already a disconnect between fast_float's nan and isfloat's allow_nan - if using a user wanted to identify isfloat any value that would be substituted with nan they would be unable because of how isfloat behaves with float("nan"). This gets even more complicated with query_type. Ideally the API and output of query_type would be consistent with isfloat or isint, but if we disable identification of float("nan") as a float with allow_nan, then what type would it return?

At the end of the day, I would like the behavior of all three of these functions to behave consistently. Do you have any suggestions to ensure this?

SethMMorton avatar May 26 '20 03:05 SethMMorton

Copied from (a now-deleted) comment by @argenisleon in #39 that was intended from this issue.


I think fastnumbers could handle 'nan' as True. From the readme

>>> # Customize handling for nan or inf
>>> isfloat('nan')
False
>>> isfloat('nan', allow_nan=True)
True

In this way fastnumbers will answer True if a string to number is float. I think the allow_nan param should no be necessary

SethMMorton avatar Jun 07 '20 00:06 SethMMorton

I have cleaned/moved up some conversation that belonged as part of #39.

SethMMorton avatar Jun 07 '20 00:06 SethMMorton

@argenisleon The issue with the proposal to have isfloat always handle "nan" is that it removes flexibility from users. Not everybody has the same goal in mind when using these functions, and I can see use cases where people actively do not want to introduce NaN in to their dataset.

SethMMorton avatar Jun 07 '20 00:06 SethMMorton

When I think about this request as well as the requests in #39 and #40, it boils down to giving the user more flexibility. I think this is a good idea. I also believe that I have a solution that will address all three issues that will not require changing the internal logic - just modifying the user interface. Further, this can be introduced as the new, improved interface while keeping the old interface for backwards compatibility (though I would probably undocument the old interface).


Here would be the proposed new API for fast_float (the **kwargs is for the backwards-compatibility stuff). The text of the API is first-draft and rather verbose.

def fast_float(
    x,
    *,
    nan=ALLOWED,
    inf=ALLOWED,
    on_fail=INPUT,
    on_typerror=RAISE,
    allow_underscores=True,
    **kwargs
):
    """
    Quickly convert input to a `float`.

    <more text>

    Parameters
    ---------------
    x
        The input you wish to convert to a `float`.
    nan : optional
        Control how NaN is interpreted/handled. The default is the enum *ALLOWED*, which
        indicates that both the string "nan" or the float NaN are accepted. Other valid values
        are *STRING_ONLY* to indicate that only "nan" is accepted, *NUMBER_ONLY* to
        indicate only NaN is accepted, *INPUT* to indicate that if given this value it should
        be returned as-is, *RAISE* to indicate a *ValueError*, should be raised, a callable
        accepting a single argument that will be called with the input to return an alternate
        value, or a default value to be returned instead of NaN.
    inf : optional
        Control how INF is interpreted/handled. The default is the enum *ALLOWED*, which
        indicates that both the string "inf" or the float INF are accepted. Other valid values
        are *STRING_ONLY* to indicate that only "inf" is accepted, *NUMBER_ONLY* to
        indicate only INF is accepted, *INPUT* to indicate that if given this value it should
        be returned as-is, *RAISE* to indicate a *ValueError* should be raised, a callable
        accepting a single argument that will be called with the input to return an alternate
        value, or a default value to be returned instead of INF.
    on_fail : optional
        Control what happens when an input string cannot be converted to a float. The
        default is the enum value *INPUT* which indicates that the value should be
        returned as-is. Other valid values are *RAISE* to indicate a *ValueError* should
        be raised, a callable accepting a single argument that will be called with the input
        to return an alternate value, or a default value to be returned instead of the input
        string.
    on_typerror : optional
        Control what happens when the input is neither numeric nor string. The default is
        the enum value *RAISE* which indicates that a *TypeError* should be raised. Other
        valid values are *INPUT" to indicate the input should be returned as-is, a callable
        accepting a single argument that will be called with the input to return an alternate
        value, or a default value to be returned instead of the input value.
    """

Here would be the proposed new API for isfloat

def isfloat(
    x,
    *,
    nan=ALLOWED,
    inf=ALLOWED,
    consider=*,
    strict=False,
    allow_underscores=True,
    **kwargs
):
    """
    Quickly determine if a string is a float.

    <more text>

    Parameters
    ---------------
    x
        The input you wish to test if it is a float.
    nan : optional
        Control if and in what form NaN is interpreted. The default is the enum value *ALLOWED*,
        which indicates that either "nan" or NaN will return *True*. Other allowed values are
        *STRING_ONLY*, which indicates that only "nan" will return *True*, or *NUMBER_ONLY*,
        which indicates that only NaN will return *True*, or *DISALLOWED*, which means neither
        will return *True*.
    inf : optional
        Control if and in what form INF is interpreted. The default is the enum value *ALLOWED*,
        which indicates that either "inf" or INF will return *True*. Other allowed values are
        *STRING_ONLY*, which indicates that only "inf" will return *True*, or *NUMBER_ONLY*,
        which indicates that only INF will return *True*, or *DISALLOWED*, which means neither
        will return *True*.
    consider : optional
        Control the data types that may be interpreted. By default but string and numeric input
        may be considered. If given *STRING_ONLY*, then only string input may return *True*.
        if given *NUMBER_ONLY*, then only numeric input may return *True*.
    strict : bool, optional
        Control what string values may be considered as float. If *True*, then the string "56.0"
        would return *True* but "56" would return *False*. If *False* (the default), then both of
        the strings "56.0" and "56" would return *True*.
    """

With the above changes, your request becomes possible:

>>> from fastnumbers import isfloat, DISALLOWED
>>> isfloat(float("nan"), nan=DISALLOWED)
False

The request from #39 becomes possible:

>>> from fastnumbers import isfloat
>>> isfloat("56", strict=True)
True

The request from #40 becomes possible:

>>> from fastnumbers import fast_float, INPUT
>>> fast_float(None, on_typerror=INPUT)
None

SethMMorton avatar Jun 15 '20 05:06 SethMMorton

@argenisleon I am on vacation next week and would like to add these changes during that time. I will implement all your requests as I have described above unless you indicate the proposal does not meet your request somehow.

SethMMorton avatar Aug 09 '20 02:08 SethMMorton

@SethMMorton Just wanted to chime in and +1 to this feature/enhancement. The new API sounds very intuitive to me.

Would you be able to extend this functionality to the isreal and fast_real functions? I have a current use case where I'm testing to see which elements of a numpy input array of Object dtype are finite values, essentially making a boolean mask. The Object dtype makes it so that each value in the numpy array can be a float or an int. The current implementation I have is along the lines of np.vectorize(lambda value: isreal(value) and np.isfinite(fast_real(value))), which is slowed down by all the excessive function calls.

I tested with np.vectorize(isreal) and saw a 4-5X speed up, but because of the Object dtype, nan and inf values are considered floats.

wiltonwu avatar Aug 11 '20 03:08 wiltonwu

@wiltonwu Yes, the proof-of-concept that I showed here used the float functions, but I would apply the API to all functions - I hate disjointed APIs.

SethMMorton avatar Aug 11 '20 03:08 SethMMorton

@wiltonwu I'm also interested to hear that you found just giving the function to np.vectorize gives a nice performance improvement. I would be interested in a PR that adds your findings on this to the documentation.

SethMMorton avatar Aug 11 '20 03:08 SethMMorton

@SethMMorton sorry for the late response. Yes, the request still handles my needs. Thanks

argenisleon avatar Aug 11 '20 17:08 argenisleon

@wiltonwu I'm also interested to hear that you found just giving the function to np.vectorize gives a nice performance improvement. I would be interested in a PR that adds your findings on this to the documentation.

Oh the performance improvement was between (1)np.vectorize(lambda value: isreal(value) and np.isfinite(fast_real(value)))(array) and (2)np.vectorize(isreal)(array), which makes me suspect that np.vectorize(isreal)(array, nan=DISALLOWED, inf=DISALLOWED) will also be faster than implementation (1).

Tangential to the discussion in this thread, but the conversation has me thinking that it may be beneficial to implement the "checking" functions as numpy universal functions. My hypothesis is that leveraging numpy broadcasting will be significantly faster than np.vectorize. If you think that's valuable, I can open a separate issue for that proposal.

wiltonwu avatar Aug 14 '20 19:08 wiltonwu

Just so you know - I am still planning to do this. Unfortunately, open source does not take a priority over family, so this has been pushed aside.

SethMMorton avatar Nov 23 '20 03:11 SethMMorton

Lol - I realize it's been two years, but I have started working on this again.

I have thought a lot about backwards compatibility of this, and I have decided that instead of adding more kwargs to existing functions and then un-documenting old kwargs, I will create new functions with the new desired behavior and keep the old as-is, never to be removed. I will mark them as deprecated in the documentation.

The new functions will have names that I feel better reflect their intent, and I wish they were the names I had used originally. The fast_* function updates will be named to try_*, and the is* function updates will be named check_*.

I hope to release this and some performance updates in the next month or so.

SethMMorton avatar Jan 24 '23 23:01 SethMMorton

@argenisleon I realize it's been a couple of years since you filed these issues, and you may have found other solutions, but I have finally released them as part of version 4.0.0 if you are still interested.

SethMMorton avatar Feb 02 '23 07:02 SethMMorton