rbx-dom
rbx-dom copied to clipboard
Fix "stream did not contain valid UTF-8" using String::from_utf8_lossy
I am using rbx_binary::from_reader(input) to load files in my rust program, and some of the files make the function spit out the error "stream did not contain valid UTF-8".
I decided that I'm proficient enough at Rust to find out how to fix this pesky issue myself, and here's what I've come up with. I found two places which would produce this error and have patched them using the String::from_utf8_lossy function.
Here is an image of an offending section of a Script.Source property after it was mercilessly butchered into valid UTF-8:
I'm testing this and it didn't entirely fix the issue, only one case. I will keep investigating.
Hi, what problem are you running into and trying to fix? Right now I have no idea beyond that you're evidently having trouble with non-UTF8 names for Instances?
^ Added String::from_utf8_lossy to Name property
I have .rbxl files which, when I decode with
rbx_binary::from_reader(input)
Throw this UTF-8 error.
Just to clarify, is there a reason why you want/need invalid UTF-8 for the name property on Instances?
We've had discussions about this before internally and it's unclear if it's actually beneficial to support names like that. If we were going to, we'd probably want to swap to using a string replacement like BStr over lossy conversions though.
So far I have found the issue with both Name and Source property. My current project is that I'm converting the 1006 .rbxl files I downloaded off my place version history into git version history, but I've also run into this issue with map files for my actual game which I wanted to scan for scripts. So it's both reading files I created a very long time ago, and reading files that I didn't create.
All 1006 files have decoded with these changes! I now have a git history all the way back to 2015!
I updated the original message with some contextual information, sorry about the initial lack of clarity.
Going to tag @kennethloeffler in so that he can share his thoughts on this. I don't want to leave you hanging on this and between us we should be able to make a final call on how we want to solve this problem.
We've had a bit of discussion about this aspect of rbx-dom already: https://github.com/rojo-rbx/rbx-dom/issues/313.
A lossy conversion is okay, but is... well, lossy. Roblox's implementation doesn't do any conversions whatsoever on String property contents, and diverging from that doesn't feel great to me. It could lead to busted results - for example, when Lua code expects a certain non UTF-8 instance name, when binary data is stored in a StringValue, etc.
I think I'm okay with doing a lossy conversion so rbx-dom can successfully-ish decode more files, as long as we document it and print a warning when we have to do it so that there are no murky surprises. But we should really consider bringing rbx-dom's String properties up to parity with Roblox's by representing them as Vec<u8> or similar. That is a breaking change, so we'll have to think about how it fits into the next major version.
Do we need to do anything to rbx_xml for this? @Dekkonot
How about making lossy string conversion an optional decoder setting for now? There are lots of maps on my game that simply don't work because of this, so I'm using this patched version. I can agree that effectively destroying strings by default can be seen as a bug that should not be relied on.
How about making lossy string conversion an optional decoder setting for now? There are lots of maps on my game that simply don't work because of this, so I'm using this patched version. I can agree that effectively destroying strings by default can be seen as a bug that should not be relied on.
Meh, it's better than the behavior we have now (which is just to fall over and exit without producing a result). I think it should be fine until we properly support non UTF-8 String properties. Most of the time, these sorts of strings are present only because of "virus" instances, which no one wants or really cares about anyway.
Make sure to import std::borrow::Cow too, right now this branch will not compile
This needs an entry in the changelog but otherwise looks good to me.
Do we need to do anything to rbx_xml for this?
Unicode in general doesn't currently work in rbx_xml so no, but only because we're in a bad spot anyway. @kennethloeffler
Squashed and rebased
Added changelog
Does Roblox use UTF-16? I know there's a crate for that, maybe that's desirable instead of BStr.
No, Roblox uses UTF-8 encoding. It just also allows binary data to be put into strings. What we really need to do is not pretend that Roblox strings are always UTF-8. That's outside of scope for this PR though.
@kennethloeffler When you have a moment, can you take a look at this again? Turns out it needs your approval, haha.