perl5 Tie::File ruins UTF-8 encoded files [rt.cpan.org #96014]

Migrated from rt.cpan.org#96014 (status was 'new')

Requestors:

[email protected]

From [email protected] on 2014-05-27 21:10:00 :

It is not safe to use Tie::File to operate on a file with a multibyte encoding, such as UTF-8. This has been a defect in Tie::File since it was first written. I was aware at the time that this would be a problem, but at the time it seemed too hard to fix. I know a little bit more now, and the problem has become considerably more urgent since 2005. It should be fixed. I would like to fix this, but I need some help.

The essential problem is as follows. Tie::File holds an array that records the tell() position at which each record in the file begins, as reported by seek(). Tie::File makes the not-completely-warranted assumption that these offsets are actually byte counts, but I think this is a minor issue.

The major issue this this: Suppose that record 3 begins at tell() offset 100, and record 4 begins at tell() offset 150. In some places Tie::File then concludes from this that record 3 is 50 bytes long, which I think is generally correct.

But in other places Tie::File will calculate the length of record 3 by reading it in and calling length() on the result. Since what is wanted is the length in bytes, this is wrong except for legacy encodings where each character is one byte.

Now suppose Tie::File is asked to replace record 3 with some string $s. Tie::File uses length() on $s to find out how many bytes $s is, which is wrong, and it uses length() on record 3 similarly wrongly. Then it compares these two wrong lengths to decide whether record 3 can be overwritten with $s in-place, or whether the tail of the file needs to be copied upwards (if $s is shorter than the old record 3) or downwards (if it is longer). All these calculations are being done with character lengths, but they should be done with byte lengths. Since they are done wrongly, Tie::File mangles the data file when it modifies the record.

On review, I see the following relevant ten-year-old comment:

 # length($oldrec) here is not consistent with text mode  TODO XXX BUG

The enclosed program demonstrates the problem. It writes a correct UTF-8 encoded file, copies it to stdout, then uses Tie::File to modify a record in the middle, then copies the resulting carbled file to stdout.

What I think Tie::File needs to do is to find out how many bytes $s will occupy once written to the file, and use that in place of length($s) in its length calculations. When I wrote Tie::File in 2005 there was no way to do this, Encode not having been invented. But I think now I can use Encode to transform $s to a suitably-encoded byte string, and then use the existing Tie::File machinery to write the byte string to the file. But I'm not sure this is correct; I need someone with some domain knowledge to help me make the right changes.

Also it seems to me that to choose the correct encoding, Tie::File needs some way to interrogate the filehandle it is given, if it is given one, and it needs to allow an "encoding => ...." option to be supplied in the tie() call when it is given a filename. There may be other encoding-related options it should support that I have no thought of.

#!/usr/bin/perl

use Tie::File;
use Fcntl;

{ open my($fh), ">:raw", "tf-test-data" or die $!;
  print $fh "Fl\303\274ghafen
Chinese:\344\270\255\345\234\213\345\223\262\345\255\270\346\233\270\351\233\273\345\255\220\345\214\226\350\250\210\345\212\203
Potat\303\270es\n";
}

binmode(STDOUT, ":utf8");
{ open my($fh), "+<:utf8", "tf-test-data"
  or die $!;
  print while <$fh>;
  print "-------\n";
}

{
  open my($fh), "+<:utf8", "tf-test-data";
  my @A;
  tie @A, Tie::File => $fh or die;
  $A[1] = "octopus";
}

{ open my($fh), "+<:utf8", "tf-test-data"
  or die $!;
  print while <$fh>;
  print "-------\n";
}
__DATA__

Jan 19 '20 03:01 toddr

@khwilliamson any ideas?

Jan 19 '20 04:01 toddr

I think @tonycoz and @leont may be better for this than I. I don't know if github will send them this, as it didn't autocomplete their ids.

This might actually be a legitimate use of the bytes pragrma. One could write

sub byte_length { use bytes; return length shift; }

Jan 19 '20 05:01 khwilliamson

I'm transferring this issue to https://github.com/Perl/perl5 since it is issue with a module in dist.

Jan 27 '20 05:01 toddr

@tonycoz @Leont any thoughts

Apr 16 '22 19:04 khwilliamson

I don't think this really can work. :encoding makes sense on a stream, for random access you'd want Tie::File to do the encoding itself.

Apr 17 '22 12:04 Leont

I've made an attempt at fixing this issue, https://github.com/brainbuz/perl5/tree/trytofix17494. Where previously the file could be badly mangled, I'm getting much smaller errors, and seeing weird behavior of an additional character at the beginning of the following line.

Jan 24 '24 01:01 brainbuz

perl5 perl5 copied to clipboard

Tie::File ruins UTF-8 encoded files [rt.cpan.org #96014]

perl5
perl5 copied to clipboard