String search with `str.encoding=guess` is very unreliable.
Environment information
- Operating System: All
- Cutter version: https://github.com/rizinorg/cutter/pull/3418
- Obtained from:
- [x] Built from source
- [ ] Downloaded release from Cutter website or GitHub
- [ ] Distribution repository
- File format: All
Describe the bug
To Reproduce
Steps to reproduce the behavior:
Search an UTF32 string, which encodes non-ASCII characters with the str.encoding set to guess. It won't be able to find it. Also other non-UTF8 encodings.
Expected behavior
It finds the strings.
Screenshots
Additional context
The problem is Rizin's string encoding detection. It is unreliable. Cutters search widget doesn't allow to the the string encoding, so it defaults to the settings. Those are by default set to guess (for compatibility reasons).
If any user finds this, the work around for now is to search strings with known encoding via the command line.
List the string search commands with /z?.
I agree that search widget should be extended with search type specificic options (with sane defaults so that they don't have to be touched most of the time). I saw that ROP search had search context struct which supports multiple options. But there is probably also room for improving the guess mode search.
What does the rizin "guess" mode currently do?
- Encodes search string in single encoding and searches for bytes?
- Searches in all known strings (automatically detected ones+manully defined by user) which have their own encoding metadata already attached?
- Tries multiple encodings and searches for bytes corresponding to each of them?
Something else?
What does the rizin "guess" mode currently do?
The search /z comand and the ps command have different "guess string encoding" functions.
ps simply checks if it can decode a consecutive number of valid code points in any encoding (see rz_str_guess_encoding_from_buffer). It is a little buggy, because UTF16 only is checked for ASCII code points.
The search does something similar, though more complex (see: rz_scan_strings_raw). Common to both of them is, that they return early if they confirmed 1-2 consecutive valid code points of the same encoding. They assume the first hit is correct. So if, at a given buffer offset, UTF16 AND UTF32 could be valid, it just accepts whatever was "matched" first (ignoring the other possibility).
I implemented an alternative (see this comment) but we didn't merge it, because it got too big.