facets icon indicating copy to clipboard operation
facets copied to clipboard

Add Array#remove

Open TylerRick opened this issue 3 years ago • 8 comments

Motivation

There are many cases when

  1. "duplicate" elements in an array are allowed and
  2. both the number and order of those elements is important (for example, 'racecar'.chars). (If they weren't important, why aren't you using Set?)

So why isn't there an easy way to remove an element from such an array in a way that respects both the order and number (count) of each element? Why do all methods for removing elements from an array assume that you always want to remove all matching elements from the receiver, with no option to do otherwise?

I could not find any way to do this with the standard library. (Let me know if I'm missing something.) You would think this would be possible with Arrays, which do allow you to have duplicate values (as opposed to Sets, which do not).

One might even think (as I did until I tried it) that this would be the behavior that the built-in Array#- would give us, but alas it is not.

This is similar to Array#- and Array#difference, except that instead of removing all matches, it only removes as many occurrences as you actually ask it to subtract:

   > [1, 1, 2].subtract [1]
  => [1, 2]

    > [1, 1, 2].subtract [1, 1]
  => [2]

   > [1, 1, 2] - [1]
  => [2]

   > [1, 1, 2].difference [1]
  => [2]

Name

I'm not sure of the best name for this, but for now I'm calling it subtract since it's a nice and simple name.

I would prefer to call it Array#- but that operator is obviously taken by the language itself.

subtract?

The biggest downside I can think of for subtract is that one might assume that it is completely synonymous and identical to Array#- operator. Just like Set#- is identical to Set#subtract (technically dup.subtract). (I was going to point to String#- as an example (from string/remove.rb / string/op_sub.rb, but I guess that is aliased as remove rather than subtract.)

The other downside is that Array#subtract would be inconsistent with Set#subtract in that Set#subtract modifies the receiver but Array#subtract does not (does a dup). I would like to have both an in-place modification version and a dup version, so I was thinking subtract! and subtract. But that's not consistent with the analogs in Set: subtract (in-place modification) and - (dup). I guess that's further evidence that users should rightfully expect subtract to have the same semantics as - (the only allowed difference being in-place modification).

So as much as I like the name subtract, I think I may have reluctantly talked myself out of that name, since I value consistency of semantics with built-in method names more highly.

One thing I like about the name subtract is that it still has the connotation of the arithmetic operation of subtraction, which "represents the operation of removing objects from a collection", and I believe this implementation behaves more like the intuitive understanding of subtraction than the built-in Array#- does. If you have 3 apples (['apple']*3) and you take away 2 of them (['apple']*2), you should be left with 1 apple (['apple']*1), not 0 apples. Yet:

 > ['apple']*3 - ['apple']*2
=> []

Wat. Therefore, I propose subtract:

 > (['apple']*3).subtract(['apple']*2)
=> ["apple"]

Doesn't read as nicely as an operator would, but what alternative do we have? The only obvious operator, -, is taken. ... Unless we could abuse -@ somehow to our benefit?

Comparison table

A table might be helpful...

in-place modification returning a new object
Set: operation to remove all occurrences of a single element/condition delete, delete_if reject
Set: operation to remove all occurrences of every element in object subtract -
Array: operation to remove all occurrences of a single element/condition delete, delete_at, delete_if reject
Array: operation to remove all occurrences of every element in object delete_values (Facets) -, difference, without (ActiveSupport)
Array: operation to remove only first/n occurrences of a single element/condition delete_first (proposed) (none)
Array: operation to remove only first/n occurrences of every element in object delete_first_each! (proposed) delete_first_each (proposed)

I guess the analog to Set#subtract in Array is delete_values (Facets).

So I propose adding subtract as alias to delete_values and a new method, perhaps subtract_preserving_duplicates for the new operation I am proposing.

The semantics of generic operations like - are open to interpretation

The problem with generic-sounding methods like subtract is that everyone may have their own assumptions about what it will/should do. Hash#- is a good example of this. But unlike Hash#-, which probably has dozens of ways it could behave, I can only think of 2 main plausible variations of Array#-: either remove all matches or only one of each el in other_array.

Other names considered

  • subtract_first_occurrence

    • Con: Too long. Unclear.
  • subtract_first/ remove_first

    • Con: Unclear.
  • subtract_once / delete_each_once / remove_once

    • Con: The "once" part is unclear.
  • delete_only / remove_only / subtract_only (to contrast with the implied all (delete_all, subtract_all) implied by delete, -, etc.)

    • Con: It's hard to make clear that we're referring to elements of the other_array argument, and not the other_array argument itself. We would almost have to change the method signature to delete_only(*elements) instead of accepting an array/enumerable, in order for this name to make sense.
  • elementwise_delete_once

    • Makes it clear that we're deleting elements of object, but ... too long.
  • subtract_respecting_counts

    • Con: Too long. A bit ambiguous.
  • subtract_preserving_duplicates / difference_preserving_duplicates

    • Pro: One of the clearest names.
    • Con: Too long.
  • subtract_occurrences

  • subtract_n (as in, "only n occurrences of each element in other")

  • subtract_allowing_for_dups / diff_allowing_for_dups

  • delete_each_from

    • Con: Implies that it uses delete, but delete removes all copies of el, so this name is ambiguous)
  • delete_each_first

    • I like the "each" part, but implies delete semantics...
  • delete_first_each

    • If we move the "each" after delete_first, then it may make it clearer that it uses the delete_first operation for each, rather than delete for each.
  • remove

    • Pro: Short. Doesn't imply equivalence to - like subtract does.
    • Pro: Analogous to String#remove
    • Con: May sound like it's going to remove them in-place (like delete does). But as a counterexample, String#remove doesn't modify the receiver (there is a #remove! variant that does).
    • Con: If we're intentionally being analogous to String#remove, then that suffers from the same problem that the subtract <=> - analogy suffers from: String#remove removes all occurrences (gsub).
  • Looking at other synonyms for "subtract"...

  • whittle

  • downsize

Perhaps you can think of a better name?

How can we make the semantics clear without being wordy?

TylerRick avatar Jul 09 '20 02:07 TylerRick

As an aside: The docs for Array#- say:

If you need set-like behavior, see the library class Set.

but I don't understand what it means by that. Do you?

Array#- (just like Array#|, #delete) already has rather set-like behavior: it removes all copies of any element in self that appear in other (almost as if you had called to_set first to get rid of duplicate items, except that it doesn't de-dupe any copies of elements that you aren't deleting).

TylerRick avatar Jul 09 '20 02:07 TylerRick

Crazy idea: abuse -@ to give different semantics for subtraction?

It looks like the Array#-@ operator is not currently used for anything. Would it be a terrible idea to add a -@ operator that returns a Subtrahend object and override the built-in + to accept this Subtrahend object (delegating anything else to super (the existing, built-in behavior)?

Then we could write:

['apple']*3 + -['apple']*2
#=> ["apple"]

Is that any better? It's obviously still not as nice as ['apple']*3 - ['apple']*2.

And subtrahend is technically the number being subtracted, so semantically it should be passed to the - operator... but that wouldn't read correctly because 2 - make a +:

['apple']*3 - -['apple']*2

(if it existed) would be expected to be equivalent to:

['apple']*3 + ['apple']*2

How about a suffix/flag on the array to indicate count should be preserved?

Since Ruby's default seems to be to de-duplicate during operations such as Array#- and Array#difference and (obviously, because it's "set union") Array#|, maybe we need a flag to be able to explicitly "opt out" of that behavior.

How about something like...

(['apple']*3).preserve_duplicates - (['apple']*2)

?

Maybe that would wrap the array in something like a ArrayPreservingDuplicates object... which seems redundant since arrays already preserve duplicate elements (at rest) — just not through - operation.

If we did that, would the .preserve_duplicates suffix go on the minuend or on the subtrahend? Or maybe putting it on either operand would be enough to effect this behavior...

TylerRick avatar Jul 09 '20 18:07 TylerRick

As an aside: The docs for Array#- say:

If you need set-like behavior, see the library class Set.

but I don't understand what it means by that. Do you?

It probably means or refers on how is implemented Set#-

# File lib/set.rb, line 462
def -(enum)
  dup.subtract(enum)
end

# File lib/set.rb, line 441
def subtract(enum)
  do_with_enum(enum) { |o| delete(o) }
  self
end

Looks familiar?

esparta avatar Jul 09 '20 19:07 esparta

Thanks for the reply, @esparta!

It probably means or refers on how is implemented Set#-

# File lib/set.rb, line 462
def -(enum)
  dup.subtract(enum)
end

# File lib/set.rb, line 441
def subtract(enum)
  do_with_enum(enum) { |o| delete(o) }
  self
end

Looks familiar?

Ah, okay. I admit, I didn't look closely enough at Set to even notice that it already had a subtract method. That seems to put the final nail in the coffin of the idea of using subtract for this new operation. If anything, Array#subtract should be an alias for Array#delete_values. I've modified my proposal to reflect this.

Haha, I guess it looks familiar in the sense that it looks remarkably like the implementation I came up with for Array#subtract! (not sure if that's what you were alluding to, or something else):

  def subtract!(other)
    other.each do |el|
      delete_at(index(el))
    end
    self
  end

So to clarify, what I find confusing about them mentioning "If you need set-like behavior, see the library class Set" in the documentation for Array#- is that that statement is generally true, and makes sense as a note in the Array class in general — but as a note specific to Array#-, it's a bit misleading because Array#- already has rather set-like behavior. Array#- (and friends) are kind of half-way array-like (in that it leaves intact duplicate occurrences of elements that you aren't deleting) and half-way set-like (in that it deletes duplicates of elements that you are deleting) in their behavior.

If example in Array#- had been this:

   [ 1, 1, 2, 2, 3, 3, 4, 5 ] - [ 1, 2, 4 ]  #=>  [ 1, 2, 3, 3, 5 ]

instead of this:

   [ 1, 1, 2, 2, 3, 3, 4, 5 ] - [ 1, 2, 4 ]  #=>  [ 3, 3, 5 ]

, then I could see making a call-out to Set as providing contrasting behavior.

But since their example was:

[ 1, 1, 2, 2, 3, 3, 4, 5 ] - [ 1, 2, 4 ]
#=>  [ 3, 3, 5 ]

, the behavior this would be contrasted with in Set would be:

[ 1, 1, 2, 2, 3, 3, 4, 5 ].to_set - [ 1, 2, 4 ].to_set
# => #<Set: {3, 5}>

But that result is remarkably similar! The only difference is that Set#- results in the duplicate occurrences of 3 being collapsed into just 1 occurrence. The other duplicate occurrences are treated the same in both cases.

But that difference in behavior seems due more to the inherent differences in the data structures in general (Array allows duplicate elements at rest, Set doesn't) than a result of a difference in behavior between Array#- and Set#- specifically. Right?

Maybe it's just me, but the rb_ary_includes_by_eql in the implementation of Array#- already seems set-like to me:

    if (RARRAY_LEN(ary1) <= SMALL_ARRAY_LEN || RARRAY_LEN(ary2) <= SMALL_ARRAY_LEN) {
	for (i=0; i<RARRAY_LEN(ary1); i++) {
	    VALUE elt = rb_ary_elt(ary1, i);
	    if (rb_ary_includes_by_eql(ary2, elt)) continue;
	    rb_ary_push(ary3, elt);
	}
	return ary3;
    }

(Sorry this went long...)

TylerRick avatar Jul 09 '20 20:07 TylerRick

Haha, I guess it looks familiar in the sense that it looks remarkably like the implementation I came up with for Array#subtract! (not sure if that's what you were alluding to, or something else

:) that's what I meant. When I did lookup for the implementation of Set#- I saw a remarkable similar implementation.

So to clarify, what I find confusing about them mentioning "If you need set-like behavior, see the library class Set" in the documentation for Array#- is that that statement is generally true, and makes sense as a note in the Array class in general — but as a note specific to Array#-, it's a bit misleading because Array#- already has rather set-like behavior.

I know what you meant, and if it helps it confused me too. It needs to be changed somehow in order to either be clear enough or remove it if doesn't make sense in the current behavior. There's a on going process of re-write the documentation on ruby core, coincidentally it's nowadays on array.c, at some point it will be cover, check Burdette Lamar PRs: https://github.com/ruby/ruby/pulls?q=is%3Apr+author%3ABurdetteLamar

About the Set behavior...

[ 1, 1, 2, 2, 3, 3, 4, 5 ].to_set - [ 1, 2, 4 ].to_set
# => #<Set: {3, 5}>

But that result is remarkably similar! The only difference is that Set#- results in the duplicate occurrences of 3 being collapsed into just 1 occurrence. The other duplicate occurrences are treated the same in both cases.

It's tricky to explain without going too deep into mathematics, but I'd say the gist is: the subtraction operator is performed after applying the conversion to Set, so at the end of this, it's basically the same. Again, is confusing to read.

Maybe it's just me, but the rb_ary_includes_by_eql in the implementation of Array#- already seems set-like to me:

    if (RARRAY_LEN(ary1) <= SMALL_ARRAY_LEN || RARRAY_LEN(ary2) <= SMALL_ARRAY_LEN) {
	for (i=0; i<RARRAY_LEN(ary1); i++) {
	    VALUE elt = rb_ary_elt(ary1, i);
	    if (rb_ary_includes_by_eql(ary2, elt)) continue;
	    rb_ary_push(ary3, elt);
	}
	return ary3;
    }

IMO, it's kind of different, on both array's branches (for smallers < 16 or big > 16) is the same: add to C if A[i] is not in B; for Set#- and your proposal is Delete on A[i] if B[n]

esparta avatar Jul 09 '20 22:07 esparta

I know what you meant, and if it helps it confused me too.

Thanks, glad I'm not the only one :smile:

It needs to be changed somehow in order to either be clear enough or remove it if doesn't make sense in the current behavior. There's a on going process of re-write the documentation on ruby core, coincidentally it's nowadays on array.c, at some point it will be cover, check Burdette Lamar PRs: https://github.com/ruby/ruby/pulls?q=is%3Apr+author%3ABurdetteLamar

Wow, Burdette Lamar is doing some excellent work there! Thanks for pointing that out. I hope this part of the docs can be fixed too as part of that effort.


It's tricky to explain without going too deep into mathematics, but I'd say the gist is: the subtraction operator is performed after applying the conversion to Set, so at the end of this, it's basically the same. Again, is confusing to read.

Not sure if I quite get your point here, but at least we agree that it is confusing. :smile:

When you say the - operator is performed after converting to Set, I assume you're referring to when we explicitly cast to_set and use Set#-?:

[ 1, 1, 2, 2, 4, 5 ].to_set - [ 1, 2, 4 ].to_set
# => #<Set: {5}>

So sure, it makes sense to give that result when you explicitly convert the objects to Sets but it doesn't make as much sense that Array#- (which doesn't actually convert anything to a Set) gives an identical (in this example), set-like result:

[ 1, 1, 2, 2, 4, 5 ] - [ 1, 2, 4 ]
#=>  [ 5 ]

I think @olivierlacan said it best when he said:

But it is not a set. It’s an array. If I wanted a set, I would to_set.

and:

One may expect Array#- to behave like mathematical subtraction or difference when it doesn't. One could be forgiven to expect the following behavior:

[1,1,2,2,3,3,4,4] - [1,2,3,4]
=> [1,2,3,4]

TylerRick avatar Jul 13 '20 19:07 TylerRick

It occurred to me that this would be easier to think about and explain if we actually split this into 2 operations, which I'm tentatively calling:

  • delete_first: does just what it sounds like: it deletes the given first occurrence of the element from the receiver (destructively).
    • This name intentionally starts with delete_ to make it very clear that it behave similarly to delete. And the _first part makes it clear in which way it is different from delete
  • delete_first_each:
    • This name tries to make it clear that it uses the delete_first operation (rather than delete as Set#subtract does) for each element in other.
    • The _each part makes it clear that it does this operation for each element of the passed enumerable, similar to reverse_each, each_index, each_cons, each_slice. (Could call it each_delete_first but that doesn't read as well in English: as in, "delete each element..." but replacing the delete operation with delete_first.)

This would fill in 3 holes in this table...

in-place modification returning a new object
Set: operation to remove all occurrences of a single element/condition delete, delete_if reject
Set: operation to remove all occurrences of every element in object subtract -
Array: operation to remove all occurrences of a single element/condition delete, delete_at, delete_if reject
Array: operation to remove all occurrences of every element in object delete_values (Facets) -, difference, without (ActiveSupport)
Array: operation to remove only first/n occurrences of a single element/condition delete_first (proposed) (none)
Array: operation to remove only first/n occurrences of every element in object delete_first_each! (proposed) delete_first_each (proposed)

I don't love the name delete_first_each, but it's the clearest name I've come up with so far. Can anyone come up with a better name?

Downsides of name delete_first_each:

  • It seems strange to name a method each_ when intuitively it acts more like an operation on 2 arrays (like "subtract"). This name is maybe too much describing how its implementation details than describing the operation at a higher level.
  • Lacks any connection to the mathematical subtraction operation.

This operation really should be the operation that bears the name -/difference since it is the truest to the mathematical subtraction operation, but since those names are already taken by existing methods in Ruby (which as Olivier pointed out would have been more appropriately named Array#set_difference maybe or even Array#subtract_all), we have no choice but too pick a less-ideal name for this one. (Unless Matz is willing to revert the new Array#difference and accept https://bugs.ruby-lang.org/issues/11815 instead, but that is probably unlikely.)

So, is delete_first_each better than subtract?

I still like the name subtract better. It's just that it's not consistent with Set#subtract (the destructive version of Set#-) and consistency is probably a more important concern...

TylerRick avatar Jul 13 '20 19:07 TylerRick

How about Array#remove? That is the name Cary Swoveland proposed in https://bugs.ruby-lang.org/issues/11815 after his original Array#difference name became no longer available.

And thinking about that option more, it doesn't seem as objectionable to me as it once did:

  • Since the goal is to ultimately get it accepted in Ruby core, it doesn't really matter if it matches the semantics of anything in Facets, such as String#remove. (We might even want to rename String#remove to remove_all.) It only matters whether it is consistent with or conflicts with what is in Ruby core.
  • remove may sound like it removes the elements destructively like delete does — but not necessarily. In fact, the fact that the name doesn't contain the word "delete" (which always means it modifies the receiver in place in Array, Hash, and Set) may be enough to clarify that it doesn't have the same destructive semantics that delete has. This allows us to have both remove and remove! variants.

TylerRick avatar Jul 14 '20 00:07 TylerRick