RATools icon indicating copy to clipboard operation
RATools copied to clipboard

[Feature Request] Matching strings in memory

Open suXinjke opened this issue 2 years ago • 2 comments

As RetroAchievement starts to support more powerful consoles such as Playstation Portable and Playstation 2, maybe GameCube and Wii in the future, achievement developers encounter string based values more frequently when developing for these consoles.

At the moment it's not comfortable to work with the strings: you have to either define string-to-number dictionaries, or constants. If you don't do it without scripting or some automated conversion - it's error prone, as ABCD string is 41 42 43 44 as bytes, but 0x44434241 as number. It becomes even less uncomfortable to work with if you have strings that take more than 4 bytes - now you have to avoid mistakes regarding offsets, and instead of representing one string constant, you will likely have to chunk and pass it in weird ways.

Matching strings in memory

As strings are a representation of byte array, why not start with byte array matching:

byte_array = [ 0x43, 0x41, 0x4B, 0x45, 0x32, 0x32 ]
match_byte_array(0xCAFE, byte_array, max_size = 16)
// which is equal to
all_of(range(0, length(byte_array) - 1), idx => byte(0xCAFE + idx) == byte_array[idx])
// which is equal to
0xH00cafe=10_0xH00caff=11_0xH00cb00=12_0xH00cb01=13_0xH00cb02=14_0xH00cb03=15
// or, if optimizer aims to save space, 32-bit for first 4 bytes, then 16-bit for remaining 2 bytes
0xX00cafe=1162559811_0x00cb02=12850

You should notice that I included max_size parameter, consider it optional with some default value. It's important to have it to prevent accidental passes of big arrays which may stress RATools. Maybe there isn't much need to allow user to specify the max_size and such limit can be hardcoded internally instead.

Now, this allows to do the actual string matching:

match_string(0xCAFE, "CAKE22", max_size = 16)
// which will be equal to
match_byte_array(0xCAFE, [ 0x43, 0x41, 0x4B, 0x45, 0x32, 0x32 ], max_size = 16)

And that's it - given byte array matching, internally the whole job is converting the string to byte array.

Have to remember Unicode strings though, sometimes such strings exist in the game but I don't think they'd be used for non-presentational purposes. If it happens to be trivial to convert string to UTF8 byte array - why not allow specifying them?

Alternate representation

If we take inspiration from Cheat Engine and existing type system, I can also suggest syntax like this:

someData = byte_array(0xCAFE, size = 6)
// you can now refer to it in other places instead of writing match function calls

someData == [ 0x43, 0x41, 0x4B, 0x45, 0x32, 0x32 ]
someString = string(0xCAFE, size = 6)
// you can now refer to it in other places instead of writing match function calls

someString == "CAKE22"

Such way of representing byte arrays and strings allows developers to define their own matching functions. If you do it such way, then RATools has to check that array/string you compare to is not exceeding the specified size. If byte array or the string you compare to has less bytes in it - compare as if original byte array or string was defined with such lesser size.

Miscellaneous functions

Some additional related proposals, not sure how useful they will be:

String iteration - Unicode problematic

any_of(some_string, x => x == "B")

for x in some_string {
    if (x == "B") {
        // do something
    }
}

String indexing - Unicode problematic

some_string = "ABC"
if (some_string[1] == "B") {
    // do something
}

String to byte array

array = string_to_byte_array("CAKE22") // [ 0x43, 0x41, 0x4B, 0x45, 0x32, 0x32 ]
// so you can do something like
array_push(array, 0xFF) // [ 0x43, 0x41, 0x4B, 0x45, 0x32, 0x32, 0xFF ]

suXinjke avatar Sep 17 '22 10:09 suXinjke

Firstly, I want to explicitly differentiate between ASCII string functions and Unicode string functions. While the toolkit currently only supports ASCII strings, RATools could potentially support both.

Here are my proposals, based on your suggestions:

Matching strings in memory

ascii_string_equals(address, string, length = -1)

This would allow you to match the memory starting at address to the first length bytes of string. The default value for length would be -1 or something similar that would tell the function to just use the length of the string parameter + 1 for the null terminator.

I would expect it to collapse the comparison into 32-bit chunks.

So ascii_string_equals(0xCAFE, "CAKE22") would generate dword(0xCAFE) == 0x45464143 && tbyte(0xCAFE+4) == 0x3232

Allowing length to be specified allows for not including the null terminator when desirable, or only matching the first part of a string without having to first truncate it in the code.

I don't like the suggestion to have a variable representation of the string in memory (s=string(0xCAFE, 6)) for comparison operators as the only operators that would ever be supported are equality and inequality, and the definition of the comparison logic relies more on what it's being compared to than where the memory is. If you had string(0xCAFE, size=12) == "CAKE22", the compiler could assume it was always false since "CAKE22" is less than 12 characters long. Conversely, ascii_string_equals(0xCAFE, "CAKE22") will only compare the 7 bytes (null terminated) and doesn't care that the memory could hold up to 12 characters.

String iteration, indexing, and conversion to byte array

There's a separate issue for indexing and iterating strings: #141. In my response there, I recommend having a way to convert the string to an array of integers and use that for indexing and iterating. As such, I would propose:

ascii_string_to_byte_array(string)

The example provided in #141 could also be handled by the functionality I described above: ascii_string_equals(0x1234, "game over", 9), and I feel that having ascii_string_equals would supersede #141.

The examples you've provided for string iteration are mixing runtime logic and script processing time logic. any_of generates runtime logic. You can't use it at processing time, so while any_of(string, x => x == "B") could be written as any_of(range(string_start, string_length), a => byte(a) == 0x4B), it couldn't tell you if "CABLE" contained "B" when processing the script. For that, I'd recommend a string_contains function that looked at script strings. If a helper function was desirable to do the same thing at runtime, maybe something like ascii_string_find(range_start, range_length, string).

Splitting

The other major string function currently missing from RATools is splitting. There's no way to extract a substring (or character) from a string. I'm not sure how often you'd want to do this, but the substring functionality would fit better with individual character matching as RATools doesn't have the concept of a character data type and you would just comparing to a string with length 1.

I think splitting would be sufficient for your example of somestring[1] == "B" as "B" is string, so you could do substring(somestring, 1, 1) == "B". For a runtime check, you'd just have to do the index offset yourself: ascii_string_equals(address + 1, "B", 1)

Jamiras avatar Sep 17 '22 18:09 Jamiras

I don't like ascii_string_equals implicitly adding null terminator. I don't believe achievement devs will often work with entire strings including the null terminator. If we speak of level ID strings etc - they are usually brief enough and/or beginning/substring portion of it is distinct to represent with just 4 bytes or less. So if I happen to not read docs well to notice the size thing, and if I also don't know C programming either, the default behavior of adding null terminator may be a WTF moment.

Maybe move implicit null terminator behavior into separate function, like ascii_cstring_equals ?

Also you didn't address RATools behavior if length is abnormally big, should there be internal hardcoded string size limit?


About any_of, I didn't have any mixing in mind, but I wasn't clear enough and my example sucked, here's a better one:

any_of(some_string, x => x == byte(someAddress)

To check if byte has any letter of specified string.

It'd implicitly convert string to byte array, but being implicit is bad there, especially when we don't know if ascii or unicode is expected. So gotta stay explicit

any_of(ascii_string_to_byte_array(some_string), x => x == byte(someAddress)


substring function idea allows for more flexibility - ok, but then array-like indexing seems like a classic programming thing that RATools would lack. substring function call to get only one character is verbose, but I can live with verbose


I'm fine with ascii / unicode separation


If you had string(0xCAFE, size=12) == "CAKE22", the compiler could assume it was always false since "CAKE22" is less than 12 characters long.

I probably should've named the argument max_size instead of size, and the equality comparison would act similar to ascii_string_equals, but that's implicit, and thinking about what != should do gives me headache. This makes me think it's better to keep the explicit comparison functions instead.

suXinjke avatar Sep 18 '22 15:09 suXinjke