faster_path icon indicating copy to clipboard operation
faster_path copied to clipboard

Non UTF-8 encoding support

Open danielpclark opened this issue 8 years ago • 5 comments

The Ruby spec has a test with windows encoded string basename_spec.rb#L162-L166 . This encoding is not UTF-8 compatible and is likely a variation on UTF-16 or UCS-2. Rust wasn't built to support these with the standard String or &str so custom types would need to be written to support such encodings.

The occurrence of these encodings should be virtually non-existent in web frameworks so problems would likely only arise in Windows specific applications.

Work that has been done in the community towards making a working solution includes

This would make much more sense to implement in FasterPath once windows support has been added and code compiles specifically for Windows. So this should be considered after https://github.com/danielpclark/faster_path/issues/102

danielpclark avatar Jun 01 '17 22:06 danielpclark

The test in question is

  it "returns the basename with the same encoding as the original" do
    basename = File.basename('C:/Users/Scuby Pagrubý'.encode(Encoding::Windows_1250))
    basename.should == 'Scuby Pagrubý'.encode(Encoding::Windows_1250)
    basename.encoding.should == Encoding::Windows_1250
  end

To make Rust happy the following works but some bytes of character data is lost in translation

  def self.basename(pth, ext="")
    Rust.basename(
      pth.encode(Encoding::UTF_8),
      ext.encode(Encoding::UTF_8)
    ).force_encoding(pth.encoding)
  end 

The test output result is

File.basename returns the basename with the same encoding as the original FAILED
Expected "Scuby Pagrub\xC3\xBD"
 to equal "Scuby Pagrub\xFD"

danielpclark avatar Jun 01 '17 22:06 danielpclark

This encoding is not UTF-8 compatible and is likely a variation on UTF-16 or UCS-2.

Not that this is relevant, but just an FYI: Windows-1250 is a single byte encoding. It only encodes 256 possible characters. The first 128 characters match ASCII-7BIT, and the second half is mostly accented latin letters.

glebm avatar Sep 11 '17 23:09 glebm

Thanks @glebm . Since I've rewritten this project in ruru the point of this issue is now to update the capabilities of RString in ruru. New error on TravisCI:

- returns the basename with the same encoding as the originalthread '<unnamed>' panicked
at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 21, error_len: Some(1) }', /checkout/src/libcore/result.rs:906:4
stack backtrace:
   0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
             at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at /checkout/src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at /checkout/src/libstd/sys_common/backtrace.rs:60
             at /checkout/src/libstd/panicking.rs:381
   3: std::panicking::default_hook
             at /checkout/src/libstd/panicking.rs:397
   4: std::panicking::rust_panic_with_hook
             at /checkout/src/libstd/panicking.rs:577
   5: std::panicking::begin_panic
             at /checkout/src/libstd/panicking.rs:538
   6: std::panicking::begin_panic_fmt
             at /checkout/src/libstd/panicking.rs:522
   7: rust_begin_unwind
             at /checkout/src/libstd/panicking.rs:498
   8: core::panicking::panic_fmt
             at /checkout/src/libcore/panicking.rs:71
   9: core::result::unwrap_failed
  10: ruru::class::string::RString::to_str
  11: r_basename

Unless they have alternate encoding support through alternate means. I still need to look into this.

danielpclark avatar Sep 12 '17 11:09 danielpclark

I've been thinking looking at FFI and Fiddle may give insight for where to integrate encoding from Ruby's C code.

danielpclark avatar May 01 '18 20:05 danielpclark

Good News

With the addition of encoding support in Rutie and the CodepointIterator we can move forward more easily with adding encoding support. Many of the algorithms will need to be redesigned to work by individual codepoint rather than by individual char.

danielpclark avatar Dec 22 '18 02:12 danielpclark