perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

Allow `substr` operator to be overloaded

Open leonerd opened this issue 2 years ago • 12 comments

RFC0013 calls for core's substr() operator to support operator overloading, allowing strings to provide overloading that can perform arbitrary behaviours.

Still TODO:

  • [ ] Think about how to preserve the real $pos and $len SVs when invoking overload via LVALUE magic
  • [ ] Think about how nomethod is going to play with this, given the odd shape of its arguments
  • [ ] Documentation
  • [ ] More testing

leonerd avatar Jan 23 '23 16:01 leonerd

I should clarify the comments about nomethod:

nomethod is invoked if we attempt to run an overload method but the class doesn't define it. It's a bit like AUTOLOAD. The calling signature for it is always:

$nomethod->($this, $other, $swap, $methodname);

This works with all the regular unary and binary operators, however because substr is invoked like sub ($sv, $pos, $len) {...} there isn't space for those extra args to fit in nomethod.

There doesn't seem to be a very satisfactory solution. I can think of the following:

  1. substr() does not invoke nomethod (as current code)
  2. We'd invoke nomethod with the extra pos/len args at the end, as $nomethod->($sv, undef, 0, "substr", $pos, $len)
  3. We'd invoke nomethod with the extra pos/len args in the middle, before the name, as $nomethod->($sv, undef, 0, $pos, $len, "substr")

These options all suck. It's equally likely that code will be doing my ($this, $other, $swap, $methodname) = @_ or my $methodname = pop;, so either options 1 or 2 will break one of those code shapes.

If nomethod had originally been defined with the method name upfront, as ($mname, $self, @args) then it would be neater, but that's not the world we live in.

Boo. This is complicated. Anyone any ideas?

leonerd avatar Jan 23 '23 19:01 leonerd

Maybe support a ref in $methodname, to allow extension of the API.

druud avatar Jan 23 '23 20:01 druud

Isn't this a fundamental issue with using overload? Perl's overload is for unary and binary operators, whereas substr is a function, not an operator.

My understanding of the gist of the RFC is to make substr treat its first argument differently if it is an object, performing a method lookup. This is more like an override of a method rather than an overload of an operator (if I fall back on C++ parlance).

In this case, rather than overloading overload (sorry!), how about creating a new pragma override which would be used to associate a method with a CORE::builtin, e.g.

package My::Package;
use override substr => '_my_substr';

sub _my_substr { ... }

djerius avatar Jan 24 '23 16:01 djerius

Many builtin "functions" are largely indistinguishable from operators, such as readline which is the same operator as <>, and functions such as sin, int, and log, which are all overloadable currently.

Grinnz avatar Jan 24 '23 16:01 Grinnz

Thanks for the clarification; I overlooked overloading those functions.

So we have a bucket (CORE) of functions where some may be treated as operators because they accept one parameter (and are expressly recognized by the overload pragma) and others which cannot because they have more complicated signatures.

Regardless of the historic reason for this, it's a confusing model to present to the user, as an operator is now defined by a call signature rather than by some abstract concept of what an operator is. (I personally would never have guessed that sin is an operator.)

djerius avatar Jan 24 '23 16:01 djerius

You're interpreting a distinction where there isn't one - it is not only these functions but all core functions which may be interpreted as operators, they are not distinguished in the optree.

Grinnz avatar Jan 24 '23 17:01 Grinnz

As they're all operators, and thus should be treated uniformly in terms of overloading, my suggestion of a second pragma dealing with entities with more complex calling signatures breaks that model.

Since the limitations of the existing overload pragma artificially create a distinction due to the number of arguments it can handle, maybe its time to create a new uniform interface and deprecate overload.

djerius avatar Jan 24 '23 19:01 djerius

@djerius

Since the limitations of the existing overload pragma artificially create a distinction due to the number of arguments it can handle, maybe its time to create a new uniform interface and deprecate overload.

Indeed. This very question is being discussed on the mailing list: https://www.nntp.perl.org/group/perl.perl5.porters/2023/01/msg265565.html

leonerd avatar Jan 24 '23 20:01 leonerd

You're interpreting a distinction where there isn't one - it is not only these functions but all core functions which may be interpreted as operators, they are not distinguished in the optree

That feels like ignoring a distinction when there is one. The nature of perls execution and opcode model means that an opcode can be used to represent either a function or an operator. That doesn't mean they are the same thing at all. I mean consider that at a CPU level all code ends up being executed by CPU ops so functions and operators are not distinguished by the CPU either.. So you could apply your argument to any aspect of a programming language.

I think this is a really slippery slope that has the real potential to lead to extremely inefficient code and weird outcomes. I think @djerius original suggestion makes more sense than trying to squeeze this functionality into overload, which has significant performance implications for all of our code.

It also opens up a lot of ambiguity. Consider the case of join. Some people think join should be similar to a reduce statement using concatenation. Id argue that is incorrect and it is already defined as a sequence of concatenations of the stringify operator. And in fact defining it as a sequence of concatenations means that we have to implement it in an inefficient way.

If we don't define join(",",@array) as a series of concatenation operations, but rather as the concatenation of a series of stringify operations, then we could in theory implement it in such a way that we join half in one thread and half in the other. If we define it as a series of concatenation operations we essentially forbid changing the order we do the join in (at least when it contains an overloaded object). A more practical example might be a smarter join that scans over its arguments, concatenates all the "normal" strings as efficiently as possible using memcpy() or whatnot, and then finally calls the stringify methods of the elements which have overloads. If we make this use concatenation that option is out. Which will either prevent us from writing a more optimal join, or will result in bifurcated code which has to detect there is an overload, and then fallback to the slower logic when there is. (It wouldnt be the first place we do this.)

I think this notion that "since there is no difference between operators and functions at the opcode level there is no difference" is mistaken, there is a difference. The way the code looks is different, and the problems it exposes are different, and thus it should use a different API.

I look at substr as being an operation that should call the stringify overload method, and then operate on the result. RFC13 seems to want to look at substr() as something else which I dont understand and which I find hard to model in my head. Operators I can model in my head what the "correct" behavior is. Functions leave a lot more room for ambiguity.

I think that if we merge the work product from RFC13 it must be feature guarded (and the feature NOT be made a standard one) so we don't make everyone pay for what I consider to be a curiosity. If every join() and substr() operation slows down because three people are going to use this new functionality then I would view it as a serious backwards step.

demerphq avatar Jan 25 '23 03:01 demerphq

@demerphq

I look at substr as being an operation that should call the stringify overload method, and then operate on the result. RFC13 seems to want to look at substr() as something else which I dont understand and which I find hard to model in my head. Operators I can model in my head what the "correct" behavior is. Functions leave a lot more room for ambiguity.

The point of (this half of) RFC0013 is to make substr the same kind of thing for strings, that int, sqrt, sin, etc... are for numbers. I.e. the expression of a conceptual operation, that individual object classes can decide to implement their own way. Perl (and CPAN) happens to provide a number of object classes that "behave a bit like numbers", and use all these operator overloads to smudge over the fact, allowing you to write code naturally that operates on complex numbers, bigints, bigfloats, vectors, matrices, whatever, as if they were normal native numbers. One really really crucial point is that other code does not need to know about this, to still work. I can take the mean of a bunch of numbers by doing

my $count = scalar @numbers;
my $total = shift @numbers;
$total += $_ for @numbers;
return $total / $count;

and it doesn't actually matter if these were real SVt_NV numbers, or some object type that represents complex numbers, bigints, whatever... All that matters is these objects say to the world "treat me like you'd treat a number" by using the overload interface, and such code will work just fine.

But that's only numbers. Ironically for a language that claims to be good at "text processing", perl doesn't actually have almost anything in the way of making objects that behave like string values. This is what RFC0013 wanted to begin to address.

My use-case here is for String::Tagged or subclasses thereof. These are objects that are supposed to "feel like strings", but have extra data hanging around inside them, that you can get at if you ask them the right way (i.e. by object methods). I first wrote overload::substr (the CPAN module) in response of the fact that Text::Wrap cannot work on overloaded objects. What I'd love to be able to do is take one of these String::Tagged instances and throw it at code such as Text::Wrap, which has (and needs) no specific knowledge of the object classes in question, and have it know that it can split the text up in lines using substr() and then join them back together using join "\n", @lines and the result would still be my object class with all the associated data attached to it. Right now that cannot happen, because pp_substr first stringifies the object into a plain SVt_PV then slices out character ranges from that. It never gives the object an opportunity to solve the question better, in a way that e.g. pp_add or pp_sqrt do for numbers and objects that want to be treated as numbers.

leonerd avatar Jan 25 '23 16:01 leonerd

Furthermore: If we want to have a discussion about this overall concept, I would suggest we should discuss it in the wider context of RFC0013, rather than specifically within this PR that tries to implement a specific part of it. Ideally it would have been useful to have these words a year ago when RFC0013 was being discussed in the first place.

That was the point of having an RFC process - to allow such "should we be doing this and if so how?" debates to take place well before people go out and do the work to actually implement it.

leonerd avatar Jan 25 '23 16:01 leonerd

Resolved conflicts as promised.

demerphq avatar Feb 20 '23 14:02 demerphq