perl5 Deparse forgets use utf8

Migrated from rt.perl.org#90590 (status was 'open')

Searchable as RT90590$

May 14 '11 21:05 p5pRT

From [email protected]

This should certainly be emitting a use utf8 at the top:

% perl -CS -MO=Deparse,-p -E 'say "\N{U+3b1}-\N{U+3c9}"' BEGIN { $^H{'feature_unicode'} = q(1); $^H{'feature_say'} = q(1); $^H{'feature_state'} = q(1); $^H{'feature_switch'} = q(1); } say('α-ω'); -e syntax OK

--tom

Summary of my perl5 (revision 5 version 12 subversion 3) configuration:
Platform: osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd uname='openbsd chthon 4.4 generic#0 i386 ' config_args='-des' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include', optimize='-O2', cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib libs=-lgdbm -lm -lutil -lc perllibs=-lm -lutil -lc libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl): Compile-time options: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF Built under openbsd Compiled at Feb 14 2011 07:32:03 %ENV: PERL_UNICODE="SA" @INC: /usr/local/lib/perl5/site_perl/5.12.3/OpenBSD.i386-openbsd /usr/local/lib/perl5/site_perl/5.12.3 /usr/local/lib/perl5/5.12.3/OpenBSD.i386-openbsd /usr/local/lib/perl5/5.12.3 /usr/local/lib/perl5/site_perl/5.11.3 /usr/local/lib/perl5/site_perl/5.10.1 /usr/local/lib/perl5/site_perl/5.10.0 /usr/local/lib/perl5/site_perl/5.8.7 /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.6.0 /usr/local/lib/perl5/site_perl/5.005 /usr/local/lib/perl5/site_perl .

May 14 '11 21:05 p5pRT

From @cpansprout

On Sat May 14 14:14:45 2011, tom christiansen wrote:

This should certainly be emitting a use utf8 at the top:

% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\('α\-ω'\);
\-e syntax OK

If it were to put ‘use utf8’ at the top, would it not make sense for it to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist of utf8.

And should it behave differently depending on whether the output is going straight to STDERR/OUT (whichever it is) or being returned by coderef2text?

--

Father Chrysostomos

Jan 05 '12 22:01 p5pRT

The RT System itself - Status changed from 'new' to 'open'

Jan 05 '12 22:01 p5pRT

From @cpansprout

On Thu Jan 05 14:04:42 2012, sprout wrote:

On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for it to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist of utf8.

And should it behave differently depending on whether the output is going straight to STDERR/OUT (whichever it is) or being returned by coderef2text?

Don’t forget that (under use v5.16) eval("'\x{100}'") does the same thing as evalbytes("use utf8; '\xc4\x80'").

--

Father Chrysostomos

Jan 05 '12 22:01 p5pRT

From @ikegami

On Thu, Jan 5, 2012 at 5:04 PM, Father Chrysostomos via RT < perlbug-followup@perl.org> wrote:

On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for it to output a stream of utf8 bytes without -CS?

That's not relevant. The issue is that the string built by the program generated by Deparse is different than the string built by the original program.

$ perl -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;' 3

$ perl -MO=Deparse -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;' | perl Wide character in print at /home/eric/usr/perlbrew/perls/perl-5.14.0t/lib/5.14.0/B/Deparse.pm line 1213. -e syntax OK 5

Jan 05 '12 22:01 p5pRT

From @cpansprout

On Thu Jan 05 14:24:41 2012, ikegami@adaelis.com wrote:

On Thu, Jan 5, 2012 at 5:04 PM, Father Chrysostomos via RT < perlbug-followup@perl.org> wrote:
On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for it to output a stream of utf8 bytes without -CS?
That's not relevant. The issue is that the string built by the program generated by Deparse is different than the string built by the original program.

$ perl -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;' 3

$ perl -MO=Deparse -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;' | perl Wide character in print at /home/eric/usr/perlbrew/perls/perl-5.14.0t/lib/5.14.0/B/Deparse.pm line 1213. -e syntax OK 5

Note the wide character warning. If one were to eval() the string instead of outputting it, it would produce the same result.

It makes sense to me to *encode* output as utf8 by default, with ‘use utf8’. But unless we are going to encode it before it reaches the PerlIO layer (because when we have ‘use utf8’, the bytes, not the characters, make up the source code), it doesn’t make sense (to me) to add ‘use utf8’.

--

Father Chrysostomos

Jan 05 '12 22:01 p5pRT

From @ikegami

On Thu, Jan 5, 2012 at 5:39 PM, Father Chrysostomos via RT < perlbug-followup@perl.org> wrote:

unless we are going to encode it before it reaches the PerlIO layer (because when we have ‘use utf8’, the bytes, not the characters, make up the source code), it doesn’t make sense (to me) to add ‘use utf8’.

Agree.

It makes sense to me to *encode* output as utf8 by default, with ‘use

utf8’.

Agree.

Jan 05 '12 22:01 p5pRT

From @ap

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-05 23:05]:

On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for it to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist of utf8.

And should it behave differently depending on whether the output is going straight to STDERR/OUT (whichever it is) or being returned by coderef2text?

I think Deparse needs to do better for strings. The output here should look like this instead:

BEGIN { $^H{'feature_unicode'} = q(1); $^H{'feature_say'} = q(1); $^H{'feature_state'} = q(1); $^H{'feature_switch'} = q(1); } say("\x{03B1}-\x{03C9}");

That would be independent of encodings (well, beyond… ASCII I guess) as well as semantically explicit.

Currently Deparse actually does the opposite transform for strings – if you have a "\x{03B1}" in your source it will claim Perl saw a 'α'. That is correct but not truly in the spirit of “showing you what Perl thought you meant”, esp. when you consider that it gives you no (easy) way to tell whether that 'ñ' was really "\x{D1}" or actually "n\x{0303}".

(Maybe it should even use \N by default. In fact I would be sure, if it weren’t for the verbosity that this entails. As things are, I’d say that it should be requestable by argument instead and can in any case be left out for later.)

Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>

Jan 06 '12 00:01 p5pRT

From @cpansprout

On Thu Jan 05 16:33:41 2012, aristotle wrote:

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-05 23:05]:
On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for it to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist of utf8.

And should it behave differently depending on whether the output is going straight to STDERR/OUT (whichever it is) or being returned by coderef2text?
I think Deparse needs to do better for strings. The output here should look like this instead:
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$"\\x\{03B1\}\-\\x\{03C9\}"$;
That would be independent of encodings (well, beyond… ASCII I guess) as well as semantically explicit.

Currently Deparse actually does the opposite transform for strings – if you have a "\x{03B1}" in your source it will claim Perl saw a 'α'. That is correct but not truly in the spirit of “showing you what Perl thought you meant”, esp. when you consider that it gives you no (easy) way to tell whether that 'ñ' was really "\x{D1}" or actually "n\x{0303}".

What about symbol names?

--

Father Chrysostomos

Jan 06 '12 00:01 p5pRT

From @ap

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-06 01:45]:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they are irrelevant beyond the question of whether they’re identical or not, so B::Deobfuscate demonstrates one way of dealing with them: they could be replaced with some unambiguous representation of the original names (when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

Then again identifiers should be getting normalised anyway (which Brian is working on anyhow, I think?), so the question may be less pressing for them in the first place than it is for strings, which the parser obviously has to retain faithfully.

Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>

Jan 06 '12 03:01 p5pRT

From @cpansprout

This ticket is about whether B::Deparse output should use "\x{100}" or "Ā" and whether the latter should be encoded or not and whether the output should include ‘use utf8’.

On Thu Jan 05 19:30:00 2012, aristotle wrote:

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-06 01:45]:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they are irrelevant beyond the question of whether they’re identical or not, so B::Deobfuscate demonstrates one way of dealing with them: they could be replaced with some unambiguous representation of the original names (when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

To make things more complex: What about /(?<айдэнтыфайер>)/? You can’t escape those characters, because you get a syntax error. You can’t change them, because they correspond to hash keys.

Also, the question as to whether coderef2text output should be evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of bytes, so it can be output without wide char warnings:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу' Wide character in print at lib/B/Deparse.pm line 1588. use utf8; our $фу; -e syntax OK

But that coderef2text should be a Unicode string so it can be fed to ‘eval’.)

--

Father Chrysostomos

Dec 11 '14 06:12 p5pRT

From @cpansprout

On Wed Dec 10 22:00:32 2014, sprout wrote:

This ticket is about whether B::Deparse output should use "\x{100}" or "Ā" and whether the latter should be encoded or not and whether the output should include ‘use utf8’.

On Thu Jan 05 19:30:00 2012, aristotle wrote:

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-06 01:45]:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they are irrelevant beyond the question of whether they’re identical or not, so B::Deobfuscate demonstrates one way of dealing with them: they could be replaced with some unambiguous representation of the original names (when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

To make things more complex: What about /(?<айдэнтыфайер>)/? You can’t escape those characters, because you get a syntax error. You can’t change them, because they correspond to hash keys.

Also, the question as to whether coderef2text output should be evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of bytes, so it can be output without wide char warnings:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу' Wide character in print at lib/B/Deparse.pm line 1588. use utf8; our $фу; -e syntax OK

But that coderef2text should be a Unicode string so it can be fed to ‘eval’.)

And here is a similar issue:

use utf8; my $e = "Böck"; ok(utf8::is_utf8($e),"got a unicode string - rt75680");

I recently made it so that the "Böck" is output with an escape, just to avoid malformation errors. (It was being emitted as Latin-1, so the output fed back to perl resulted in corrupt strings.)

But now the problem is that the test (from t/re/pat.t) fails, because we do longer have a utf8-flagged string. Granted, this test is too sensitive, in that it is checking the internal storage of a scalar. But this is a *core* test that just ensures that the tests that follow are testing what we think they are testing. This is another case where the core tests don’t lend themselves to being deparsed and re-run.

--

Father Chrysostomos

Dec 11 '14 06:12 p5pRT

From @ap

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2014-12-11 07:05]:

To make things more complex: What about /(?<айдэнтыфайер>)/? You can’t escape those characters, because you get a syntax error. You can’t change them, because they correspond to hash keys.

Ugh. *scrunchface* Your nose for lurking evil is just too good… :-)

Now, what answer do you expect? If that leaves no other option, then it leaves no other option. If there is only one way it can work, then that is the way it has to work. It would still be nice to get string literals with escapes… But quite evidently now they are a special case, with the general case going the other way.

Also, the question as to whether coderef2text output should be evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of bytes, so it can be output without wide char warnings:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу' Wide character in print at lib/B/Deparse.pm line 1588. use utf8; our $фу; -e syntax OK

Certainly.

But that coderef2text should be a Unicode string so it can be fed to ‘eval’.)

Seems a wash outside of the usability issue that people are probably more likely to use `eval` and not even know about `evalbytes`, so yeah, I suppose.

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2014-12-11 07:20]:

And here is a similar issue:
    use utf8;
    my $e = "Böck";
    ok$utf8&#8203;::is\_utf8\($e$\,"got a unicode string \- rt75680"\);
I recently made it so that the "Böck" is output with an escape, just to avoid malformation errors. (It was being emitted as Latin-1, so the output fed back to perl resulted in corrupt strings.)

But now the problem is that the test (from t/re/pat.t) fails, because we do longer have a utf8-flagged string. Granted, this test is too sensitive, in that it is checking the internal storage of a scalar. But this is a *core* test that just ensures that the tests that follow are testing what we think they are testing. This is another case where the core tests don’t lend themselves to being deparsed and re-run.

Is that testing the regexp engine or the parser? If it’s not testing the parser – and it looks to me like it isn’t – then why is it testing how the string was parsed, instead of just forcibly up- and downgrading it as needed to ensure the UTF-8 flag value required by following tests?

It should still assert that the flag has the required value, of course, just not rely on parser internals to set it.

I don’t like perl making promises that particular forms of writing the same string as a literal will reliably yield a particular UTF-8 flag value, and user code should not be relying on that. Of course, due to the imperfect state of the platform, some code has to care about the state of the UTF-8 flag, even though ideally none ever would. But even code which has such legitimate needs should not be relying on exactly how literals are parsed, IMO. It should upgrade or downgrade explicitly.

So as far as I care, this is a bug in the test. Not a bug in Deparse. As far as I care, Deparse here is working correctly (after your fix).

— • —

OTOH, if the test *were* trying to test the parser, I would say this is somewhat of a conundrum case. I would still maintain that Deparse works correctly. It’s just that the test tests something that depends on the exact form of the source – which Deparse will never be able to promise to preserve in pristine perfection.

It’s one thing for Deparse to preserve the exact semantics of a program. It certainly ought to try its damnedest to do that, even if that is not attainable in the general case. I.e. deviations from this ideal are bugs even if they have to be considered unfixable.

But because there are many semantically identical representations of any one thing in a program, and the verbatim original representation is not preserved, Deparse is, by its very nature, inapplicable to the class of programs whose semantics depend on the specific choice among multiple possible representations.

So at best you can test that Deparse re-deparses them consistently after they have been parsed again, as I recently mused elsewhere. (It ought to always roundtrip identically when fed its own output.)

Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>

Dec 13 '14 09:12 p5pRT

What this does in 5.35.10 is BEGIN { $^W = 1; } use feature 'current_sub', 'bitwise', 'evalbytes', 'fc', 'postderef_qq', 'say', 'state', 'switch', 'unicode_strings', 'unicode_eval'; say("\x{3b1}-\x{3c9}"); -e syntax OK And we get this perl -CS -MO=Deparse,-p -E 'use utf8; qr/(?<айдэнтыфайер>)/' BEGIN { $^W = 1; } use feature 'current_sub', 'bitwise', 'evalbytes', 'fc', 'postderef_qq', 'say', 'state', 'switch', 'unicode_strings', 'unicode_eval'; use utf8; qr/(?<\x{430}\x{439}\x{434}\x{44d}\x{43d}\x{442}\x{44b}\x{444}\x{430}\x{439}\x{435}\x{440}>)/u; This is related to http://nntp.perl.org/group/perl.perl5.porters/262961

Apr 11 '22 21:04 khwilliamson

I think the behavior is acceptable, and am taking this ticket for the purpose of closing if I don't hear objections by May 31, 2024.

May 02 '24 03:05 khwilliamson

perl5 perl5 copied to clipboard

Deparse forgets use utf8

From [email protected]

From @cpansprout

From @cpansprout

From @ikegami

From @cpansprout

From @ikegami

From @ap

From @cpansprout

From @ap

From @cpansprout

From @cpansprout

From @ap

perl5
perl5 copied to clipboard