wtfpython icon indicating copy to clipboard operation
wtfpython copied to clipboard

NaN breaks sorting in Python

Open LiquidFun opened this issue 3 years ago • 2 comments

While reading some issues for Python I stumbled upon this weird NaN behaviour.

When supplying NaN to max or min, it will return NaN if it is the first argument:

>>> max(float("NaN"), 1)
nan
>>> max(1, float("NaN"))
1

It's essentially to do with NaN always returning False in comparisons with other numbers:

>>> float("NaN") < 1
False
>>> float("NaN") > 1
False

The implementation of both min and max take the first element as the current min/max value and then compare every other value with it. But since every other number returns False in that comparison, it will never change the current min/max value. The opposite is the case then NaN is in any other position, then the min/max value will never become NaN.

This alone might not be enough for a snippet, but I feel the implications for this when sorting lists are quite huge. Python's sort straight up breaks when there is a NaN value in the list:

>>> sorted([3, 2, 1, float("NaN"), 1, 2, 3])
[1, 1, 2, 3, nan, 2, 3]  # ???

It seems to work correctly when NaN is at the front or the back of the list, but as soon as the NaN is somewhere in the middle it doesn't work most of the time. Timsort (and I assume algorithms such as quicksort and mergesort) struggle with that property of NaN. But even in selection sort it remains where it was, although the rest is at least sorted around it.

Selection sort (click to expand)
def selection_sort(a):
    for i in range(len(a)):
        smallest_j = i
        for j in range(i+1, len(a)):
            if a[j] < a[smallest_j]:
                smallest_j = j
        a[i], a[smallest_j] = a[smallest_j], a[i]
    return a

mylist = [110, -1, 3, 2, 1, 20, float("NaN"), 100, 1, 2, 3]
print(selection_sort(mylist))
# [-1, 1, 1, 2, 2, 3, nan, 3, 20, 100, 110]

Unfortunately, NaNs are quite common in data science to show non-existing data. For example, when pandas encounters missing entries in a .csv file it automatically puts NaNs there. So, suppose you have a large dataset and only a few entries in it have NaNs, sorting it with python's sort function will look correct at the beginning, until it encounters a NaN, where all hell breaks loose, but you might not notice it if you only look at the start of the array. It feels like introducing bugs this way is quite easy.

You can sort it with both pandas or numpy, which then works correctly. However, if you do not know that NaN has this issue, you may not think about it:

>>> import pandas as pd
>>> import numpy as np
>>> a = pd.Series([-1, -4, 3, -5, float("nan"), -2, 2, 4])
>>> a.sort_values()  # I believe it uses np.sort internally
3   -5.0
1   -4.0
5   -2.0
0   -1.0
6    2.0
2    3.0
7    4.0
4    NaN
dtype: float64
>>> np.sort(a)
array([-5., -4., -2., -1.,  2.,  3.,  4., nan])
>>> sorted(a)  # but Python's sort looks like this:
[-5.0, -4.0, -2.0, -1.0, 3.0, nan, 2.0, 4.0]

The status of this by the core developers of CPython as of 2019 was either "won't fix" or "not a bug" (link). The main reason being that the IEEE has some odd definitions for NaN. However, as recently as two weeks ago this has been brought up again, and some developers are considering "biting the bullet" and changing this behaviour (link). It remains to be seen if and when this changes, probably not that soon.

And for good measure, to further show that this is a wtfpython, some other languages handle this more intuitively:

JavaScript handles it well (click to expand)
>> Math.max(1, NaN)
NaN
>> Math.max(NaN, 1)
NaN
>> [3, 2, 1, NaN, 1, 2, 3].sort()
[ 1, 1, 2, 2, 3, 3, NaN ]
Java handles it well too (click to expand)
import java.util.Arrays;
public class Main
{
    public static void main(String[] args) {
        System.out.println(Math.max(1, Float.NaN));
        // prints: NaN
        
        System.out.println(Math.max(Float.NaN, 1));
        // prints: NaN
        
        float[] nums = new float[]{3, 2, 1, Float.NaN, 1, 2, 3};
        Arrays.sort(nums);
        System.out.println(Arrays.toString(nums));
        // prints: [1.0, 1.0, 2.0, 2.0, 3.0, 3.0, NaN]
    }
}
Rust avoids the issue somewhat (click to expand)
fn main() {
    let nan = std::f64::NAN;
    println!("{} {}", f64::max(1.0, nan), f64::max(nan, 1.0));
    // prints 1 1
    
    let vector = vec![3., 2., 1., nan, 1., 2., 3.];
    // vector.sort();
    // The above line would not compile, since f64 does not implement 
    // std::cmp::Ord, so it is not sortable.
    
    // This avoids the issue, but when people do want to sort such a list
    // they look for a solution online, finding a thread, which suggests
    // a custom comparator for f64 in the first comment. Then the second
    // comment points out that this does not work for NaN values. Then
    // the third comment shows a longer solution which does work for
    // NaN values...
    
    // For example: 
    // https://users.rust-lang.org/t/sorting-vector-of-vectors-of-f64/16264/2
}
C++ has the issue as well though (click to expand)
#include <iostream> // cout
#include <vector> // vector
#include <algorithm> // sort()
#include <cmath> // sqrt()
using namespace std;

int main() {
    cout << sqrt(-1) << endl;   
    // prints: -nan
    
    cout << max(1.0, sqrt(-1)) << " " << max(sqrt(-1), 1.0) << endl;  
    // prints: 1 -nan
    
    vector<double> nums = {3., 2., 1., sqrt(-1), 1., 2., 3.};
    sort(nums.begin(), nums.end());
    for (auto i : nums)
        cout << i << " ";
    cout << endl;
    // prints: 1 2 3 -nan 1 2 3
}

As far as I can tell there are no snippets and no issues/PRs discussing this. If you think this is fitting as a snippet I'd be happy to make a PR and add it (possibly after the nan-reflexivity snippet).

LiquidFun avatar Jun 24 '21 15:06 LiquidFun

Hi @LiquidFun thanks for the insightful write-up, I agree that this will be a good addition to the collection.

Feel free to create a PR whenever you get time, I'm looking forward to it :) And yes, it makes sense to add it right after the nan-reflexivity example.

satwikkansal avatar Jun 25 '21 16:06 satwikkansal

To be fair same behavior is true for other languages because NaN breaks weak ordering axioms which are mandated by sort routines.

yugr avatar Jan 20 '22 04:01 yugr