perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

PerlIO-encoding: Detect no progress made on input

Open khwilliamson opened this issue 1 year ago • 8 comments

This fixes GH #10258

If retrying getting the next byte doesn't make progress, fail

I'm unsure what all should be done, or if this needs to worry about input from a slow source. @leont ? But it starts the discussion

khwilliamson avatar May 18 '24 15:05 khwilliamson

Because of change in XS file, needs a version bump.

$ git diff
diff --git a/ext/PerlIO-encoding/encoding.pm b/ext/PerlIO-encoding/encoding.pm
index 886afbc488..1e8f0b379d 100644
--- a/ext/PerlIO-encoding/encoding.pm
+++ b/ext/PerlIO-encoding/encoding.pm
@@ -1,7 +1,7 @@
 package PerlIO::encoding;
 
 use strict;
-our $VERSION = '0.31';
+our $VERSION = '0.32';
 our $DEBUG = 0;
 $DEBUG and warn __PACKAGE__, " called by ", join(", ", caller), "\n";
 

jkeenan avatar May 18 '24 18:05 jkeenan

I'm unsure what all should be done, or if this needs to worry about input from a slow source. @Leont ? But it starts the discussion

I have a feeling it's wrong, but I'm not sure yet. The code is hard to follow.

Leont avatar May 19 '24 00:05 Leont

I'm wondering if

https://rt.cpan.org/Ticket/Display.html?id=129086&results=cd7972ee07e26803d6c87aca3dabb0cb

has any bearing

khwilliamson avatar May 19 '24 14:05 khwilliamson

I think it's a reasonable solution for now.

I do think it could use tests.

tonycoz avatar May 22 '24 00:05 tonycoz

As far as I can tell, it's entirely legal (but unlikely) that one needs more than two reads to complete a codepoint. This change has the potential to break working code. I would suggest to only enable this new behavior if e->flags & NEEDS_LINES

Fundamentally Encode has the wrong shape for what we're doing, and the real solution is probably to rewrite a lot of that logic. Not that I expect that to happen.

Leont avatar May 23 '24 00:05 Leont

This change has the potential to break working code. I would suggest to only enable this new behavior if e->flags & NEEDS_LINES

It should probably depend on seeing eof from the underlying layer.

I had thought there was an infinite loop issue with non-NEED_LINES encodings as well, but I wasn't able to get that infinite loop.

There is a problem with stop-at-partial handling though, any incomplete characters at end of file are discarded rather than decoded with the fallback:

$  ./perl -Ilib -CS -E 'my $x = "xx\xF1\x81\x80xx"; open my $fh, "<:encoding(UTF-8)", \$x or die; say <$fh>' 
xx\xF1\x81\x80xx
$  ./perl -Ilib -CS -E 'my $x = "xx\xF1\x81\x80"; open my $fh, "<:encoding(UTF-8)", \$x or die; say <$fh>' 
xx

tonycoz avatar May 23 '24 04:05 tonycoz

There is a problem with stop-at-partial handling though, any incomplete characters at end of file are discarded rather than decoded with the fallback:

Yeah that's part of the "we need a better encode that actually understand the concept of streaming", the decoding should have a concept of EOF.

It could also be good for performance reasons, because currently two pieces of C code are talking through Perl method calls and SV*s and that's just silly.

Leont avatar May 23 '24 11:05 Leont

I don't understand the global picture here, but looking at the detail of just the part I changed, I don't understand how that could change the incomplete chars at the end of file. It does seem to me that -a 1 return meaning both EOF and error is a problem.

The boolean I added could be changed to a count, so it quits only after retrying n times without progress. But I didn't see any advantage to that in my very limited understanding of how things work, unless there is a timeout below that can get triggered by, for example, a slow connection to the source of the stream, and a higher level protocol detects that and retries.

khwilliamson avatar May 23 '24 14:05 khwilliamson

Reviewing the discussion in this ticket, my impression is that we collectively don't have enough understanding of the problem to agree on a path forward. I recommend that the p.r. be closed and that a new Issue be opened in which we can further clarify the problem.

jkeenan avatar Aug 03 '24 11:08 jkeenan

Why not merge this as an experiment since it is early in the development cycle?

khwilliamson avatar Aug 05 '24 22:08 khwilliamson

Why not merge this as an experiment since it is early in the development cycle?

I really think that's a bad idea, because I believe it's wrong in a non-obvious way. No current test is going to find it, and when the issue does hit the cause will entirely non-obvious.

Leont avatar Aug 05 '24 23:08 Leont

Unless some committer explicitly says this ticket should remain open, I'm going to close it in no less than 7 days.

jkeenan avatar Aug 15 '24 11:08 jkeenan

Unless some committer explicitly says this ticket should remain open, I'm going to close it in no less than 7 days.

No one spoke up; closing.

jkeenan avatar Aug 23 '24 22:08 jkeenan