faster_path
faster_path copied to clipboard
Non UTF-8 encoding support
The Ruby spec has a test with windows encoded string basename_spec.rb#L162-L166 . This encoding is not UTF-8 compatible and is likely a variation on UTF-16 or UCS-2. Rust wasn't built to support these with the standard String or &str so custom types would need to be written to support such encodings.
The occurrence of these encodings should be virtually non-existent in web frameworks so problems would likely only arise in Windows specific applications.
Work that has been done in the community towards making a working solution includes
- The WTF-8 encoding standard with the Rust crate implementation rust-wtf8.
- The rust-encoding crate — Character encoding support for Rust.
This would make much more sense to implement in FasterPath once windows support has been added and code compiles specifically for Windows. So this should be considered after https://github.com/danielpclark/faster_path/issues/102
The test in question is
it "returns the basename with the same encoding as the original" do
basename = File.basename('C:/Users/Scuby Pagrubý'.encode(Encoding::Windows_1250))
basename.should == 'Scuby Pagrubý'.encode(Encoding::Windows_1250)
basename.encoding.should == Encoding::Windows_1250
end
To make Rust happy the following works but some bytes of character data is lost in translation
def self.basename(pth, ext="")
Rust.basename(
pth.encode(Encoding::UTF_8),
ext.encode(Encoding::UTF_8)
).force_encoding(pth.encoding)
end
The test output result is
File.basename returns the basename with the same encoding as the original FAILED
Expected "Scuby Pagrub\xC3\xBD"
to equal "Scuby Pagrub\xFD"
This encoding is not UTF-8 compatible and is likely a variation on UTF-16 or UCS-2.
Not that this is relevant, but just an FYI: Windows-1250 is a single byte encoding. It only encodes 256 possible characters. The first 128 characters match ASCII-7BIT, and the second half is mostly accented latin letters.
Thanks @glebm . Since I've rewritten this project in ruru the point of this issue is now to update the capabilities of RString in ruru. New error on TravisCI:
- returns the basename with the same encoding as the originalthread '<unnamed>' panicked
at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 21, error_len: Some(1) }', /checkout/src/libcore/result.rs:906:4
stack backtrace:
0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
1: std::sys_common::backtrace::_print
at /checkout/src/libstd/sys_common/backtrace.rs:71
2: std::panicking::default_hook::{{closure}}
at /checkout/src/libstd/sys_common/backtrace.rs:60
at /checkout/src/libstd/panicking.rs:381
3: std::panicking::default_hook
at /checkout/src/libstd/panicking.rs:397
4: std::panicking::rust_panic_with_hook
at /checkout/src/libstd/panicking.rs:577
5: std::panicking::begin_panic
at /checkout/src/libstd/panicking.rs:538
6: std::panicking::begin_panic_fmt
at /checkout/src/libstd/panicking.rs:522
7: rust_begin_unwind
at /checkout/src/libstd/panicking.rs:498
8: core::panicking::panic_fmt
at /checkout/src/libcore/panicking.rs:71
9: core::result::unwrap_failed
10: ruru::class::string::RString::to_str
11: r_basename
Unless they have alternate encoding support through alternate means. I still need to look into this.
I've been thinking looking at FFI and Fiddle may give insight for where to integrate encoding from Ruby's C code.
Good News
With the addition of encoding support in Rutie and the CodepointIterator we can move forward more easily with adding encoding support. Many of the algorithms will need to be redesigned to work by individual codepoint rather than by individual char.