wtfpython
wtfpython copied to clipboard
NaN breaks sorting in Python
While reading some issues for Python I stumbled upon this weird NaN
behaviour.
When supplying NaN
to max or min, it will return NaN
if it is the first argument:
>>> max(float("NaN"), 1)
nan
>>> max(1, float("NaN"))
1
It's essentially to do with NaN
always returning False
in comparisons with other numbers:
>>> float("NaN") < 1
False
>>> float("NaN") > 1
False
The implementation of both min and max take the first element as the current min/max value and then compare every other value with it. But since every other number returns False
in that comparison, it will never change the current min/max value. The opposite is the case then NaN
is in any other position, then the min/max value will never become NaN
.
This alone might not be enough for a snippet, but I feel the implications for this when sorting lists are quite huge. Python's sort straight up breaks when there is a NaN
value in the list:
>>> sorted([3, 2, 1, float("NaN"), 1, 2, 3])
[1, 1, 2, 3, nan, 2, 3] # ???
It seems to work correctly when NaN
is at the front or the back of the list, but as soon as the NaN
is somewhere in the middle it doesn't work most of the time. Timsort (and I assume algorithms such as quicksort and mergesort) struggle with that property of NaN
. But even in selection sort it remains where it was, although the rest is at least sorted around it.
Selection sort (click to expand)
def selection_sort(a):
for i in range(len(a)):
smallest_j = i
for j in range(i+1, len(a)):
if a[j] < a[smallest_j]:
smallest_j = j
a[i], a[smallest_j] = a[smallest_j], a[i]
return a
mylist = [110, -1, 3, 2, 1, 20, float("NaN"), 100, 1, 2, 3]
print(selection_sort(mylist))
# [-1, 1, 1, 2, 2, 3, nan, 3, 20, 100, 110]
Unfortunately, NaN
s are quite common in data science to show non-existing data. For example, when pandas
encounters missing entries in a .csv file it automatically puts NaN
s there. So, suppose you have a large dataset and only a few entries in it have NaN
s, sorting it with python's sort function will look correct at the beginning, until it encounters a NaN
, where all hell breaks loose, but you might not notice it if you only look at the start of the array. It feels like introducing bugs this way is quite easy.
You can sort it with both pandas
or numpy
, which then works correctly. However, if you do not know that NaN
has this issue, you may not think about it:
>>> import pandas as pd
>>> import numpy as np
>>> a = pd.Series([-1, -4, 3, -5, float("nan"), -2, 2, 4])
>>> a.sort_values() # I believe it uses np.sort internally
3 -5.0
1 -4.0
5 -2.0
0 -1.0
6 2.0
2 3.0
7 4.0
4 NaN
dtype: float64
>>> np.sort(a)
array([-5., -4., -2., -1., 2., 3., 4., nan])
>>> sorted(a) # but Python's sort looks like this:
[-5.0, -4.0, -2.0, -1.0, 3.0, nan, 2.0, 4.0]
The status of this by the core developers of CPython as of 2019 was either "won't fix" or "not a bug" (link). The main reason being that the IEEE has some odd definitions for NaN
. However, as recently as two weeks ago this has been brought up again, and some developers are considering "biting the bullet" and changing this behaviour (link). It remains to be seen if and when this changes, probably not that soon.
And for good measure, to further show that this is a wtfpython, some other languages handle this more intuitively:
JavaScript handles it well (click to expand)
>> Math.max(1, NaN)
NaN
>> Math.max(NaN, 1)
NaN
>> [3, 2, 1, NaN, 1, 2, 3].sort()
[ 1, 1, 2, 2, 3, 3, NaN ]
Java handles it well too (click to expand)
import java.util.Arrays;
public class Main
{
public static void main(String[] args) {
System.out.println(Math.max(1, Float.NaN));
// prints: NaN
System.out.println(Math.max(Float.NaN, 1));
// prints: NaN
float[] nums = new float[]{3, 2, 1, Float.NaN, 1, 2, 3};
Arrays.sort(nums);
System.out.println(Arrays.toString(nums));
// prints: [1.0, 1.0, 2.0, 2.0, 3.0, 3.0, NaN]
}
}
Rust avoids the issue somewhat (click to expand)
fn main() {
let nan = std::f64::NAN;
println!("{} {}", f64::max(1.0, nan), f64::max(nan, 1.0));
// prints 1 1
let vector = vec![3., 2., 1., nan, 1., 2., 3.];
// vector.sort();
// The above line would not compile, since f64 does not implement
// std::cmp::Ord, so it is not sortable.
// This avoids the issue, but when people do want to sort such a list
// they look for a solution online, finding a thread, which suggests
// a custom comparator for f64 in the first comment. Then the second
// comment points out that this does not work for NaN values. Then
// the third comment shows a longer solution which does work for
// NaN values...
// For example:
// https://users.rust-lang.org/t/sorting-vector-of-vectors-of-f64/16264/2
}
C++ has the issue as well though (click to expand)
#include <iostream> // cout
#include <vector> // vector
#include <algorithm> // sort()
#include <cmath> // sqrt()
using namespace std;
int main() {
cout << sqrt(-1) << endl;
// prints: -nan
cout << max(1.0, sqrt(-1)) << " " << max(sqrt(-1), 1.0) << endl;
// prints: 1 -nan
vector<double> nums = {3., 2., 1., sqrt(-1), 1., 2., 3.};
sort(nums.begin(), nums.end());
for (auto i : nums)
cout << i << " ";
cout << endl;
// prints: 1 2 3 -nan 1 2 3
}
As far as I can tell there are no snippets and no issues/PRs discussing this. If you think this is fitting as a snippet I'd be happy to make a PR and add it (possibly after the nan-reflexivity
snippet).
Hi @LiquidFun thanks for the insightful write-up, I agree that this will be a good addition to the collection.
Feel free to create a PR whenever you get time, I'm looking forward to it :) And yes, it makes sense to add it right after the nan-reflexivity example.
To be fair same behavior is true for other languages because NaN breaks weak ordering axioms which are mandated by sort routines.