f2e-spec
f2e-spec copied to clipboard
Check if song text is not ascii
What does this PR do?
This PR checks if song text contains non ASCII respecting UTF8-BOM. This prevents that performous crashes while loading defect songs.
Closes Issue(s)
None.
Motivation
Since I use USDB-Syncer there are defect songs in my song database. This causes that performous crashes while trying to load header data.
Download the artifacts for this pull request:
- Performous-1.3.1-970-49214bf-alpha-ubuntu_22.04.deb.zip
- Performous-1.3.1-970-49214bf-alpha-debian_11.deb.zip
- Performous-1.3.1-970-49214bf-alpha-fedora_36.rpm.zip
- Performous-1.3.1-970-49214bf-alpha-fedora_39.rpm.zip
- Performous-1.3.1-970-49214bf-alpha-fedora_37.rpm.zip
- Performous-1.3.1-970-49214bf-alpha-fedora_35.rpm.zip
- Performous-1.3.1-970-49214bf-alpha-fedora_38.rpm.zip
- Performous-1.3.1-970-49214bf-alpha-debian_12.deb.zip
- Performous-1.3.1-970-49214bf-alpha.AppImage.zip
- Performous-1.3.1-970-49214bf-alpha-debian_10.deb.zip
- Performous-1.3.1-970-49214bf-alpha-ubuntu_20.04.deb.zip
- Performous-1.3.1-970-49214bf-alpha-msvc.exe.zip
- Performous-1.3.1-970-49214bf-alpha.dmg.zip
- Performous-1.3.1-970-49214bf-alpha-mingw-w64.exe.zip
This service is provided by nightly.link. These artifacts will expire in 90 days and will not be available for download after that time.
Do you have such a song available? USDB and ultimately USDB_Syncer should deliver txt files in UTF8 (without BOM) as that's been decided on in regard of the Format project
For example:
But there are only 10k times 0xff. How ever this could happen?
Here is another example:
Now there are 7k zeros in the txt-file.
Are you taking into account the fact that unicode.cc looks for a BOM and actively removes it if it finds it?
Are you taking into account the fact that unicode.cc looks for a BOM and actively removes it if it finds it?
Yes, the BOM is ignored if it is found but it is not necessary. (That is important, cause at the place where the function is used BOM occurred and those files must not be filtered out ;-) )
The utf8 bom is by definition not necessary.
I'm not really sure what problem you're trying to fix.
Look at the description. USDB-Syncer saves why ever binary data in lyrics files that causes Performous to crash.
The MR functionality itself looks good to me. I would, however, prefer to use an utf8 check function like the following, making the utf-8 parsing more readable IMO. It should also work on files with utf-8 BOM, as it's a valid utf-8 'character'. I would say checking for ascii/utf-8 in a sufficient length of std::string should sufficient to reject binary, all 0x00 and 0xff files, although would like your opinion on this.
bool isUTF8(const std::string& text, size_t bytesToCheck) {
bytesToCheck = std::min(bytesToCheck, text.size());
for (size_t i = 0; i < bytesToCheck; ) {
if ((text[i] & 0b10000000) == 0) {
// Single byte character (0xxxxxxx)
i++;
} else if ((text[i] & 0b11100000) == 0b11000000) {
// Two byte character (110xxxxx 10xxxxxx)
if (i + 1 >= text.size() || (text[i + 1] & 0b11000000) != 0b10000000)
return false;
i += 2;
} else if ((text[i] & 0b11110000) == 0b11100000) {
// Three byte character (1110xxxx 10xxxxxx 10xxxxxx)
if (i + 2 >= text.size() ||
(text[i + 1] & 0b11000000) != 0b10000000 ||
(text[i + 2] & 0b11000000) != 0b10000000)
return false;
i += 3;
} else if ((text[i] & 0b11111000) == 0b11110000) {
// Four byte character (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx)
if (i + 3 >= text.size() ||
(text[i + 1] & 0b11000000) != 0b10000000 ||
(text[i + 2] & 0b11000000) != 0b10000000 ||
(text[i + 3] & 0b11000000) != 0b10000000)
return false;
i += 4;
} else {
// Invalid UTF-8 sequence
return false;
}
}
return true;
}
Your suggestion is good and takes more respect to the utf-8 encoding. I take it and added the ascii check and helper functions to avoid the code repetition:
bool isSingleByteCharacter(unsigned char c) {
return (c & 0b10000000) == 0;
}
bool isTwoByteCharacter(unsigned char c) {
return (c & 0b11100000) == 0b11000000;
}
bool isThreeByteCharacter(unsigned char c) {
return (c & 0b11110000) == 0b11100000;
}
bool isFourByteCharacter(unsigned char c) {
return (c & 0b11100000) == 0b11000000;
}
bool isInvalidFollowByte(unsigned char c) {
return (c & 0b11000000) != 0b10000000;
}
bool isASCII(unsigned char c) {
return c >= 32 || std::isspace(c);
}
bool isText(const std::string& text, size_t bytesToCheck) {
bytesToCheck = std::min(bytesToCheck, text.size());
for (size_t i = 0; i < bytesToCheck; ++i) {
if (isSingleByteCharacter(text[i])) {
if (!isASCII(text[i]))
return false;
}
else if (isTwoByteCharacter(text[i])) {
if (i + 1 >= text.size() || isInvalidFollowByte(text[++i]))
return false;
}
else if (isThreeByteCharacter(text[i])) {
if (i + 2 >= text.size() || isInvalidFollowByte(text[++i]) || isInvalidFollowByte(text[++i]))
return false;
}
else if (isFourByteCharacter(text[i])) {
if (i + 3 >= text.size() || isInvalidFollowByte(text[++i]) || isInvalidFollowByte(text[++i]) || isInvalidFollowByte(text[++i]))
return false;
}
else // Invalid UTF-8 sequence
return false;
}
return true;
}
@HetorusNL I think it needs some type conversions for the mac build. But do you think we can use the reworked version of your code?
BTW: There are two additional tests for the check of boundaries. 👍
The changes look good. You might want to add a/some 3-byte and 4-byte utf-8 strings to the test cases, also somewhere in the middle of the string to verify the i-increments, I guess. Regarding type conversion for the mac build, I don't know; haven't built anything for mac before 😅
@HetorusNL good job 👍 In the 4-byte code there was a copy&paste error covered by the new unit tests. Thanks. I hope it is now ready to test and merge.
@ooshlablu could you test this one. Test hints:
This PR filters out songs with broken song text files (else Performous crashed). So the number of songs available in performous should not differ in this branch compared to master branch.
I tried just loading the songs with the version from master and this PR on Ubuntu 20.04 & 22.04 and the number of songs loaded on both is the same, so it seems like it works. I doubt these changes will affect game play at all, but I will try to at least give it a sing-test tomorrow.
I just did a sing-test on Ubuntu 20.04, and everything looks normal. Will try to get a test with an instrument at least in a little bit
@ooshlablu how far is your testing? If it is finished I will merge the branch. So please give me feedback. TIA
@twollgam I just tested with all the instruments, looks good to go! I have those songs you sent me in place also, and it ignores them like it should. :+1: