Wishlist-for-R icon indicating copy to clipboard operation
Wishlist-for-R copied to clipboard

ROBUSTNESS: Should comparisons with calls be an error?

Open lionel- opened this issue 6 years ago • 6 comments

These comparisons lead to surprising results when computing-on-the-language has gone wrong (e.g. forgot to unquote a call). And they are dependent on the deparser formatting anyway:

"~f" == ~f
#> [1] TRUE

"~ f" == ~f
#> [1] FALSE

It would probably make sense to disallow quote(foo) == "foo" as well.

lionel- avatar Apr 05 '18 10:04 lionel-

I honestly feel like it would be better throwing an error here as well, but looking at the code, it's very intentional, including a comment, that these things are deparsed into strings and then hit the string relop codepath.

See https://github.com/wch/r-source/blob/a0d62ef933967e26b070f90c42cf3e1595654992/src/main/relop.c#L138-L155

I think the larger issue here is that R does "cascading conversions" when the two operands are not of the same class (probably SEXP type, actually...). That's just the general rule. See:

> "1" == 1
[1] TRUE
> " 1" == 1
[1] FALSE

The behavior of expressions doesn't seem much less correct than the above

gmbecker avatar Apr 10 '18 22:04 gmbecker

The subtyping relation in R is kind of fuzzy. It makes sense to me that any type could be a subtype of character when character is the target domain (e.g. when you paste() stuff). However when the domain is unspecified as with c() or the binary operators I think it should use much safer implicit coercions. Automatic deparsing should only happen for explicit coercions.

We are currently exploring these issues to find better semantics for our library stack (implicit coercions come up all the time in dplyr and purrr because of split-apply-combine workflows).

lionel- avatar Apr 11 '18 10:04 lionel-

I'm not sure what you mean by fuzzy. It is pretty well defined in my mind. The fundamental contract for, e.g., c() is that it tries to make everything the same type , if possible without loss of information. The vast majority of things (including all the atomic types, and symbols/expressions) can be converted to character without* loss of information, so if it can't find a more precise conversion (e.g., logical to numeric) that is what it does.

I'm not saying it's always what the user intends, but it is well defined and predicable behavior. It's also pretty core to the design of R as I understand it (much like the recycling rule, which also involves gotchas for unwary users).

For binary operators it's tempting to say you don't want much coercion, but then there are cases like character == factor and integer == numeric where you really do.

gmbecker avatar Apr 11 '18 18:04 gmbecker

The fundamental contract for, e.g., c() is that it tries to make everything the same type

That sounds fuzzy to me. And in practice it is (e.g.factor("foo") == "foo" works but not c(factor("foo"), "foo")).

Logical :> integer :> double makes sense, but numeric :> character less so. And this relation isn't enforced everywhere (though consistently enough for base operators and functions).

This can be explained by the design goal of R to trade strictness in favour of flexible interactive usage. Still, I don't think forcing users to explicitly convert to character would be too painful in practice. Implicit coercion to character should at least send a warning (including for factor to character because we lose the levels information).

tries to make everything the same type , if possible without loss of information

For coercion of vectors to be assembled, lists might be a better way of achieving this goal, especially for classed vectors.

lionel- avatar Apr 11 '18 20:04 lionel-

The fundamental contract for, e.g., c() is that it tries to make everything the same type

That sounds fuzzy to me.

I mean, the hierarchy of possible conversions between atomic types is well defined. logical > integer > numeric > (complex) > character. You give me any two things whose class falls on that ladder and I expect I'd be able to tell you what the result would be coming out of c(). Other things (e.g. expressions) go straight to character as the catchall that everything can be converted to. Factors go to characters for comparisons as that is the natural thing you'd compare them to other than another factor. The reason that factors don't work in c is because c removes attributes before combining.

Honestly converting an expression to a character doesn't seem super unnatural. expressions are code and code can be naturally represented as text.

For coercion of vectors to be assembled, lists might be a better way of achieving this goal, especially for classed vectors.

I don't understand what you mean by this. Classed vectors (ie not atomic vectors) should probably have their own methods for doing these things which do things that are semantically correct for that class.

gmbecker avatar Apr 11 '18 21:04 gmbecker

I know they are defined and not decided by a rng. I guess we'll have to disagree about how natural expr -> chr feels. It's helpful to be able to compare 1 and "1" but I'd say it's more magical than natural. That's how I felt about comparisons when I started learning R anyway. And "~foo" == ~foo is just unsound semantics to me. This makes code depend on the formatting of the R deparser :(

Classed vectors (ie not atomic vectors) should probably have their own methods for doing these things

Of course but there should be a reasonable fallback. Stripping attributes and possibly coercing to character is lossy and likely not helpful (well then we see people using c() to strip attributes so it's helpful in a way but I don't think that's a good use of c()).

The hard part is to get an extensible coercion scheme for variadic combinations. The variadic genericity in c() is probably unfixable because of backward compatibility though so that will need to happen elsewhere. I think ad hoc polymorphism (à la multi-methods) wouldn't help much here. We are going to experiment with a reducing approach.

lionel- avatar Apr 12 '18 05:04 lionel-