f2e-spec Check if song text is not ascii

trafficstars

What does this PR do?

This PR checks if song text contains non ASCII respecting UTF8-BOM. This prevents that performous crashes while loading defect songs.

Closes Issue(s)

None.

Motivation

Since I use USDB-Syncer there are defect songs in my song database. This causes that performous crashes while trying to load header data.

Feb 17 '24 10:02 twollgam

Download the artifacts for this pull request:

This service is provided by nightly.link. These artifacts will expire in 90 days and will not be available for download after that time.

Feb 17 '24 11:02 github-actions[bot]

Do you have such a song available? USDB and ultimately USDB_Syncer should deliver txt files in UTF8 (without BOM) as that's been decided on in regard of the Format project

Feb 26 '24 12:02 Baklap4

For example: But there are only 10k times 0xff. How ever this could happen?

Here is another example: Now there are 7k zeros in the txt-file.

Feb 26 '24 20:02 twollgam

Are you taking into account the fact that unicode.cc looks for a BOM and actively removes it if it finds it?

Mar 10 '24 11:03 Lord-Kamina

Are you taking into account the fact that unicode.cc looks for a BOM and actively removes it if it finds it?

Yes, the BOM is ignored if it is found but it is not necessary. (That is important, cause at the place where the function is used BOM occurred and those files must not be filtered out ;-) )

Mar 10 '24 13:03 twollgam

The utf8 bom is by definition not necessary.

I'm not really sure what problem you're trying to fix.

Mar 10 '24 17:03 Lord-Kamina

Look at the description. USDB-Syncer saves why ever binary data in lyrics files that causes Performous to crash.

Mar 10 '24 19:03 twollgam

The MR functionality itself looks good to me. I would, however, prefer to use an utf8 check function like the following, making the utf-8 parsing more readable IMO. It should also work on files with utf-8 BOM, as it's a valid utf-8 'character'. I would say checking for ascii/utf-8 in a sufficient length of std::string should sufficient to reject binary, all 0x00 and 0xff files, although would like your opinion on this.

bool isUTF8(const std::string& text, size_t bytesToCheck) {
    bytesToCheck = std::min(bytesToCheck, text.size());
    for (size_t i = 0; i < bytesToCheck; ) {
        if ((text[i] & 0b10000000) == 0) {
            // Single byte character (0xxxxxxx)
            i++;
        } else if ((text[i] & 0b11100000) == 0b11000000) {
            // Two byte character (110xxxxx 10xxxxxx)
            if (i + 1 >= text.size() || (text[i + 1] & 0b11000000) != 0b10000000)
                return false;
            i += 2;
        } else if ((text[i] & 0b11110000) == 0b11100000) {
            // Three byte character (1110xxxx 10xxxxxx 10xxxxxx)
            if (i + 2 >= text.size() || 
                (text[i + 1] & 0b11000000) != 0b10000000 || 
                (text[i + 2] & 0b11000000) != 0b10000000)
                return false;
            i += 3;
        } else if ((text[i] & 0b11111000) == 0b11110000) {
            // Four byte character (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx)
            if (i + 3 >= text.size() || 
                (text[i + 1] & 0b11000000) != 0b10000000 || 
                (text[i + 2] & 0b11000000) != 0b10000000 || 
                (text[i + 3] & 0b11000000) != 0b10000000)
                return false;
            i += 4;
        } else {
            // Invalid UTF-8 sequence
            return false;
        }
    }
    return true;
}

Mar 14 '24 12:03 HetorusNL

Your suggestion is good and takes more respect to the utf-8 encoding. I take it and added the ascii check and helper functions to avoid the code repetition:

bool isSingleByteCharacter(unsigned char c) {
    return (c & 0b10000000) == 0;
}

bool isTwoByteCharacter(unsigned char c) {
    return (c & 0b11100000) == 0b11000000;
}

bool isThreeByteCharacter(unsigned char c) {
    return (c & 0b11110000) == 0b11100000;
}

bool isFourByteCharacter(unsigned char c) {
    return (c & 0b11100000) == 0b11000000;
}

bool isInvalidFollowByte(unsigned char c) {
    return (c & 0b11000000) != 0b10000000;
}

bool isASCII(unsigned char c) {
    return c >= 32 || std::isspace(c);
}

bool isText(const std::string& text, size_t bytesToCheck) {
    bytesToCheck = std::min(bytesToCheck, text.size());
    for (size_t i = 0; i < bytesToCheck; ++i) {
        if (isSingleByteCharacter(text[i])) {
            if (!isASCII(text[i]))
                return false;
        }
        else if (isTwoByteCharacter(text[i])) {
            if (i + 1 >= text.size() || isInvalidFollowByte(text[++i]))
                return false;
        }
        else if (isThreeByteCharacter(text[i])) {
            if (i + 2 >= text.size() || isInvalidFollowByte(text[++i]) || isInvalidFollowByte(text[++i]))
                return false;
        }
        else if (isFourByteCharacter(text[i])) {
            if (i + 3 >= text.size() || isInvalidFollowByte(text[++i]) || isInvalidFollowByte(text[++i]) || isInvalidFollowByte(text[++i]))
                return false;
        }
        else // Invalid UTF-8 sequence
            return false;
    }

    return true;
}

@HetorusNL I think it needs some type conversions for the mac build. But do you think we can use the reworked version of your code?

Mar 14 '24 18:03 twollgam

BTW: There are two additional tests for the check of boundaries. 👍

Mar 14 '24 19:03 twollgam

The changes look good. You might want to add a/some 3-byte and 4-byte utf-8 strings to the test cases, also somewhere in the middle of the string to verify the i-increments, I guess. Regarding type conversion for the mac build, I don't know; haven't built anything for mac before 😅

Mar 17 '24 23:03 HetorusNL

@HetorusNL good job 👍 In the 4-byte code there was a copy&paste error covered by the new unit tests. Thanks. I hope it is now ready to test and merge.

Mar 18 '24 22:03 twollgam

@ooshlablu could you test this one. Test hints:

This PR filters out songs with broken song text files (else Performous crashed). So the number of songs available in performous should not differ in this branch compared to master branch.

Mar 18 '24 22:03 twollgam

I tried just loading the songs with the version from master and this PR on Ubuntu 20.04 & 22.04 and the number of songs loaded on both is the same, so it seems like it works. I doubt these changes will affect game play at all, but I will try to at least give it a sing-test tomorrow.

Mar 19 '24 01:03 ooshlablu

I just did a sing-test on Ubuntu 20.04, and everything looks normal. Will try to get a test with an instrument at least in a little bit

Mar 19 '24 13:03 ooshlablu

@ooshlablu how far is your testing? If it is finished I will merge the branch. So please give me feedback. TIA

Mar 22 '24 18:03 twollgam

@twollgam I just tested with all the instruments, looks good to go! I have those songs you sent me in place also, and it ignores them like it should. :+1:

Mar 23 '24 17:03 ooshlablu

f2e-spec f2e-spec copied to clipboard

Check if song text is not ascii

What does this PR do?

Closes Issue(s)

Motivation

f2e-spec
f2e-spec copied to clipboard