rbx-dom icon indicating copy to clipboard operation
rbx-dom copied to clipboard

Fix "stream did not contain valid UTF-8" using String::from_utf8_lossy

Open krakow10 opened this issue 1 year ago • 18 comments

I am using rbx_binary::from_reader(input) to load files in my rust program, and some of the files make the function spit out the error "stream did not contain valid UTF-8".

I decided that I'm proficient enough at Rust to find out how to fix this pesky issue myself, and here's what I've come up with. I found two places which would produce this error and have patched them using the String::from_utf8_lossy function.

Here is an image of an offending section of a Script.Source property after it was mercilessly butchered into valid UTF-8: image

krakow10 avatar Jan 06 '24 07:01 krakow10

I'm testing this and it didn't entirely fix the issue, only one case. I will keep investigating.

krakow10 avatar Jan 06 '24 07:01 krakow10

Hi, what problem are you running into and trying to fix? Right now I have no idea beyond that you're evidently having trouble with non-UTF8 names for Instances?

Dekkonot avatar Jan 06 '24 07:01 Dekkonot

^ Added String::from_utf8_lossy to Name property

I have .rbxl files which, when I decode with

rbx_binary::from_reader(input)

Throw this UTF-8 error.

krakow10 avatar Jan 06 '24 08:01 krakow10

Just to clarify, is there a reason why you want/need invalid UTF-8 for the name property on Instances?

We've had discussions about this before internally and it's unclear if it's actually beneficial to support names like that. If we were going to, we'd probably want to swap to using a string replacement like BStr over lossy conversions though.

Dekkonot avatar Jan 06 '24 08:01 Dekkonot

So far I have found the issue with both Name and Source property. My current project is that I'm converting the 1006 .rbxl files I downloaded off my place version history into git version history, but I've also run into this issue with map files for my actual game which I wanted to scan for scripts. So it's both reading files I created a very long time ago, and reading files that I didn't create.

krakow10 avatar Jan 06 '24 08:01 krakow10

All 1006 files have decoded with these changes! I now have a git history all the way back to 2015!

krakow10 avatar Jan 06 '24 09:01 krakow10

I updated the original message with some contextual information, sorry about the initial lack of clarity.

krakow10 avatar Jan 06 '24 21:01 krakow10

Going to tag @kennethloeffler in so that he can share his thoughts on this. I don't want to leave you hanging on this and between us we should be able to make a final call on how we want to solve this problem.

Dekkonot avatar Feb 15 '24 16:02 Dekkonot

We've had a bit of discussion about this aspect of rbx-dom already: https://github.com/rojo-rbx/rbx-dom/issues/313.

A lossy conversion is okay, but is... well, lossy. Roblox's implementation doesn't do any conversions whatsoever on String property contents, and diverging from that doesn't feel great to me. It could lead to busted results - for example, when Lua code expects a certain non UTF-8 instance name, when binary data is stored in a StringValue, etc.

I think I'm okay with doing a lossy conversion so rbx-dom can successfully-ish decode more files, as long as we document it and print a warning when we have to do it so that there are no murky surprises. But we should really consider bringing rbx-dom's String properties up to parity with Roblox's by representing them as Vec<u8> or similar. That is a breaking change, so we'll have to think about how it fits into the next major version.

kennethloeffler avatar Feb 15 '24 18:02 kennethloeffler

Do we need to do anything to rbx_xml for this? @Dekkonot

kennethloeffler avatar Feb 15 '24 23:02 kennethloeffler

How about making lossy string conversion an optional decoder setting for now? There are lots of maps on my game that simply don't work because of this, so I'm using this patched version. I can agree that effectively destroying strings by default can be seen as a bug that should not be relied on.

krakow10 avatar Feb 16 '24 01:02 krakow10

How about making lossy string conversion an optional decoder setting for now? There are lots of maps on my game that simply don't work because of this, so I'm using this patched version. I can agree that effectively destroying strings by default can be seen as a bug that should not be relied on.

Meh, it's better than the behavior we have now (which is just to fall over and exit without producing a result). I think it should be fine until we properly support non UTF-8 String properties. Most of the time, these sorts of strings are present only because of "virus" instances, which no one wants or really cares about anyway.

Make sure to import std::borrow::Cow too, right now this branch will not compile

kennethloeffler avatar Feb 16 '24 01:02 kennethloeffler

This needs an entry in the changelog but otherwise looks good to me.


Do we need to do anything to rbx_xml for this?

Unicode in general doesn't currently work in rbx_xml so no, but only because we're in a bad spot anyway. @kennethloeffler

Dekkonot avatar Feb 19 '24 16:02 Dekkonot

Squashed and rebased

krakow10 avatar Feb 25 '24 02:02 krakow10

Added changelog

krakow10 avatar Apr 24 '24 08:04 krakow10

Does Roblox use UTF-16? I know there's a crate for that, maybe that's desirable instead of BStr.

krakow10 avatar May 30 '24 09:05 krakow10

No, Roblox uses UTF-8 encoding. It just also allows binary data to be put into strings. What we really need to do is not pretend that Roblox strings are always UTF-8. That's outside of scope for this PR though.

Dekkonot avatar May 30 '24 19:05 Dekkonot

@kennethloeffler When you have a moment, can you take a look at this again? Turns out it needs your approval, haha.

Dekkonot avatar May 30 '24 19:05 Dekkonot