goblin icon indicating copy to clipboard operation
goblin copied to clipboard

Handle Malformed/Obfuscated strtab Without UTF-8 Validation Error

Open chf0x opened this issue 8 months ago • 1 comments

Hi everyone,

I encountered an issue while parsing malformed/obfuscated ELF files. Goblin tends to fail with a "bad input invalid UTF-8" error when it encounters non-valid UTF-8 strings in the .strtab section.

I believe this behavior isn't ideal, as we should still be able to parse these ELF files correctly even if the strings within .strtab aren't valid UTF-8. I think, there is no explicit requirement for the strings to be valid UTF-8 in strtab?

Here is an example of the kind of malformed input causing the issue:

    64: 0000000000000000     0 OBJECT  GLOBAL DEFAULT  UND __sF
    65: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND calloc
    66: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND B|n5:Bn72FI?:#n[...]
    67: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND B|n5:Bn72FI?:#n[...]
    68: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND B|n5:Bn72FI?:#n[...]
    69: 000000000008cbfc  1950 FUNC    GLOBAL DEFAULT    1 p:z^I�^V^�9�^X3[...]
    70: 00000000000c41ea   434 FUNC    GLOBAL DEFAULT    1 ^�7,�j5!�j8#�P71[...]
    71: 00000000000c4118   435 FUNC    GLOBAL DEFAULT    1 ^\^��T^I)�X^I^T�[...]
    72: 00000000000c4137   432 FUNC    GLOBAL DEFAULT    1 �[@7^I"Y;�)S"O/E[...]
    73: 00000000000c4127   435 FUNC    GLOBAL DEFAULT    1 |7��i51�i8o�S7��[...]

Please find the attached binary as an example. As a temporary workaround, I simply replace invalid strings with an empty string ("") when parsing fails, but I don't think this is the most appropriate solution.

I'd like to open a discussion on how we should handle these cases. Should we consider skipping over invalid UTF-8 strings instead of failing, or is there another approach we’d prefer to implement?

Thanks a lot!

chf0x avatar Apr 24 '25 14:04 chf0x

file.zip

chf0x avatar Apr 24 '25 14:04 chf0x

I've been thinking about this a lot and i think this is going to be very tricky to handle right. this came up once in the past in another context, suggestion was to return raw bytes, but rust &str are guaranteed to be utf-8. if the obfuscator uses non-utf8 sequences in the symbol table, then we need to either return raw bytes, or something like a:

enum GoblinStr<'a> {
  Raw(&'[u8])
  Str(&'a str)
}

e.g., if we fail to validate a string as utf8 when it comes out of the string table, then it gets returned as the raw bytes.

however, this would massively pessimize and make highly unergonomic string handling for the entire library due to what is arguably a very niche case when loading and parsing binaries (albeit important).

so this is a tough one i think.

m4b avatar Jun 16 '25 04:06 m4b

I agree. So probably we keep strings as is, add a permissive mode for strings to skip any strings we can't parse, instead of failing entirely? And add a separate API to extract raw strings for those who would need them?

chf0x avatar Jun 16 '25 11:06 chf0x

Another example, this time not related to strtab, but to section names. I’ll add it to this issue, as I think we can treat it as part of the broader UTF-8 meta issue. Let me know if you’d prefer to track it separately instead.

example.zip

image

chf0x avatar Jun 17 '25 12:06 chf0x

Yea thanks for your comment; I 'm going to close this as I mentioned, I don't think returning an empty string is the best solution. However, we could definitely add functions/features that make it so invalidly parsed strings are ignored/dropped, eg.., on construction of the strtab, etc. Which could itself be passed from some config options like we do with PE, etc. There are a few approaches. Please feel free to create a tracking issue, and a couple places you might be interested in seeing have better "fortified" parsing for the usecase of looking at binaries which have been obfuscated, etc. And thank you!

m4b avatar Aug 15 '25 04:08 m4b