aRrgh icon indicating copy to clipboard operation
aRrgh copied to clipboard

apply coerces to matrix, inane design decision

Open ifly6 opened this issue 6 years ago • 3 comments

Let's be honest. Apply is just broken for data frames. Defending it by saying that the user just doesn't understand the language, that the language is just fine, and the function is functioning correctly is like saying that your toolbox of misshapen tools where the hammer is just the curved end on both sides is 'just fine'.

The 'correct' way to do this in R apparently is just to write out a for loop. Fortunately for you, you can't just make a for loop iterate over rows, like for row in df.iterrows() in Pandas, you have to explicitly index them.

And fortunately for you, you can't just make a range like 1:nrow(df) (also, who made the stupid choice to call it nrow when nrows makes more sense, their being more than one row...) because if nrow(df0 == 0 then it returns a sequence (1, 0) which breaks your code when you try and run that. R is just built for robustness!

But if you're doing lots of manipulation with lists, so you're familiar with sapply, you can probably fix that issue by using apply with the proper functions, right? Wrong.

a = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
b = c('a', 'b', 'c', 'de', 'f', 'g')
c = c(1, 2, 3, 4, 5, 6)
d = c(0, 0, 0, 0, 0, 1)

wtf = data.frame(a, b, c, d)
wtf$huh = apply(wtf, 1, function(row) {
    if (row['a'] == T) { return('we win') }
    if (row['c'] < 5) { return('hooray') }
    if (row['d'] == 1) { return('a thing') }
    return('huh?')
})

You get this. Because R inexplicably decides that the best way to deal with data frames is to turn them all into data matrices first. So, here, the a column turns into ' TRUE' and 'FALSE'. Silently. Fantastic behaviour.

> wtf
      a  b c d     huh
1  TRUE  a 1 0  hooray
2 FALSE  b 2 0  hooray
3  TRUE  c 3 0  hooray
4 FALSE de 4 0  hooray
5  TRUE  f 5 0    huh?
6  TRUE  g 6 1 a thing

But in a reasonable and sensibly constructed system like Pandas, you can run the exact same thing, like this:

import pandas as pd
df = pd.DataFrame({
    'a': [True, False, True, False, True, True],
    'b': ['a', 'b', 'c', 'de', 'f', 'g'],
    'c': [1, 2, 3, 4, 5, 6],
    'd': [0, 0, 0, 0, 0, 1]
})
def funct(row):
    print(row)
    if row['a']: return 'we win'
    if row['c'] < 5: return 'horray'
    if row['d'] is 1: return 'a thing'
    return 'huh?'

df['huh'] = df.apply(funct, axis=1)
print(df)

And get reasonable answers like these that follow. Look what is possible when you don't make stupid design decisions!

       a   b  c  d     huh
0   True   a  1  0  we win
1  False   b  2  0  horray
2   True   c  3  0  we win
3  False  de  4  0  horray
4   True   f  5  0  we win
5   True   g  6  1  we win

ifly6 avatar Apr 25 '18 18:04 ifly6

At university, I learned that one type of programming language specification is to simple take an implementation of the programming language and define that as the specification. That was mostly done as a thought experiment, before moving on to the actual serious definitions, because it would lead to the insane consequence that it is actually impossible for there to be bugs in the reference implementation, since any behavior is per definition in accordance with the specification!

From a thread discussing why PHP has a left-associative ternary operator for inconceivable reasons.

Given that the response to raising this issue on the R forums was 'this is correct behaviour', I guess we shouldn't complain about anything. There are no bugs.

ifly6 avatar May 03 '18 15:05 ifly6

library(plyr)

a = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
b = c('a', 'b', 'c', 'de', 'f', 'g')
c = c(1, 2, 3, 4, 5, 6)
d = c(0, 0, 0, 0, 0, 1)

wtf = data.frame(a, b, c, d)

foo.huh <- function(row) {
    if (row['a'] == T) { return('we win') }
    if (row['c'] < 5) { return('hooray') }
    if (row['d'] == 1) { return('a thing') }
    return('huh?')
}


plyr::adply(wtf, 1, .fun = foo.huh, .expand = TRUE, .id = NULL)
#>       a  b c d     V1
#> 1  TRUE  a 1 0 we win
#> 2 FALSE  b 2 0 hooray
#> 3  TRUE  c 3 0 we win
#> 4 FALSE de 4 0 hooray
#> 5  TRUE  f 5 0 we win
#> 6  TRUE  g 6 1 we win

Created on 2018-07-06 by the reprex package (v0.2.0).

Eluvias avatar Jul 06 '18 10:07 Eluvias

I got the same result as ifly6 did in R as was offered as the "more correct" result in Python. (and then also offered via plyr construction by Eluvias.

This whole rant seems to ignore the fact that apply is designed (and documented as such) to be used for matrices. It's not "broken" for dataframes; it's just the wrong tool for dataframes. There are a bunch of other reasons NOT to use apply for dataframes, such as the coercion of each row to the "lowest common denominator" data type, so factors become, not character, but rather integers. (You didn't use the "b" column, but it would have been a factor unless your site.profile specifies the options default value of stringsAsFactors to be FALSE. Furthermore the R code used row['a'] == T while the Python code used just the value of a logical vector. That would have been correct in R. It's a common, error-prone practice error to unnecessarily test for equality to TRUE.

And the correct way to create a range that iterates over a sequence like rownames(df) is: seq_along(rownames(df)). And that is precisely because of the potential error mechanism you point out for zero length vectors.

dwinsemius avatar Oct 26 '19 22:10 dwinsemius