Handle Malformed/Obfuscated strtab Without UTF-8 Validation Error
Hi everyone,
I encountered an issue while parsing malformed/obfuscated ELF files. Goblin tends to fail with a "bad input invalid UTF-8" error when it encounters non-valid UTF-8 strings in the .strtab section.
I believe this behavior isn't ideal, as we should still be able to parse these ELF files correctly even if the strings within .strtab aren't valid UTF-8. I think, there is no explicit requirement for the strings to be valid UTF-8 in strtab?
Here is an example of the kind of malformed input causing the issue:
64: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND __sF
65: 0000000000000000 0 FUNC GLOBAL DEFAULT UND calloc
66: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND B|n5:Bn72FI?:#n[...]
67: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND B|n5:Bn72FI?:#n[...]
68: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND B|n5:Bn72FI?:#n[...]
69: 000000000008cbfc 1950 FUNC GLOBAL DEFAULT 1 p:z^I�^V^�9�^X3[...]
70: 00000000000c41ea 434 FUNC GLOBAL DEFAULT 1 ^�7,�j5!�j8#�P71[...]
71: 00000000000c4118 435 FUNC GLOBAL DEFAULT 1 ^\^��T^I)�X^I^T�[...]
72: 00000000000c4137 432 FUNC GLOBAL DEFAULT 1 �[@7^I"Y;�)S"O/E[...]
73: 00000000000c4127 435 FUNC GLOBAL DEFAULT 1 |7��i51�i8o�S7��[...]
Please find the attached binary as an example. As a temporary workaround, I simply replace invalid strings with an empty string ("") when parsing fails, but I don't think this is the most appropriate solution.
I'd like to open a discussion on how we should handle these cases. Should we consider skipping over invalid UTF-8 strings instead of failing, or is there another approach we’d prefer to implement?
Thanks a lot!
I've been thinking about this a lot and i think this is going to be very tricky to handle right. this came up once in the past in another context, suggestion was to return raw bytes, but rust &str are guaranteed to be utf-8. if the obfuscator uses non-utf8 sequences in the symbol table, then we need to either return raw bytes, or something like a:
enum GoblinStr<'a> {
Raw(&'[u8])
Str(&'a str)
}
e.g., if we fail to validate a string as utf8 when it comes out of the string table, then it gets returned as the raw bytes.
however, this would massively pessimize and make highly unergonomic string handling for the entire library due to what is arguably a very niche case when loading and parsing binaries (albeit important).
so this is a tough one i think.
I agree. So probably we keep strings as is, add a permissive mode for strings to skip any strings we can't parse, instead of failing entirely? And add a separate API to extract raw strings for those who would need them?
Another example, this time not related to strtab, but to section names. I’ll add it to this issue, as I think we can treat it as part of the broader UTF-8 meta issue. Let me know if you’d prefer to track it separately instead.
Yea thanks for your comment; I 'm going to close this as I mentioned, I don't think returning an empty string is the best solution. However, we could definitely add functions/features that make it so invalidly parsed strings are ignored/dropped, eg.., on construction of the strtab, etc. Which could itself be passed from some config options like we do with PE, etc. There are a few approaches. Please feel free to create a tracking issue, and a couple places you might be interested in seeing have better "fortified" parsing for the usecase of looking at binaries which have been obfuscated, etc. And thank you!