fix: Use correct surrogate pairs for emojis
The android implementation of the transformer encodes the emoji in a form called "surrogate", but aapt and apktool do not understand the surrogate form. Thus, detect surrogate and recompute the traditional code point value before writing it to file allowing aapt and apktool to rebuild the app successfully.
It should fix some cases (mainly for google app) of https://github.com/ReVanced/revanced-manager/issues/2142
For reference, this different solution appears to fix the same issue by exporting to a utf16 string then recoding to utf8.
At first glance I'm not sure if it's better or worse than the solution in this PR.
I tried the modifications in the abandoned PR without success for the rebuild of the google phone app.
Should I rebase my branch on dev ?
Rebasing is best since it's merging to dev.
I dislike band-aid solutions. For a couple of painful days I am experimenting a proper solution such as using a different transformer implementation like Saxon. Hopefully a proper solution can be found. If not, I'd prefer to outsource this fix-up solution as a wrapped document() API in patches repo
I agree, but the cleanest would be to fix the android transformer (to keep unicode point) or the aapt code (to handle surrogate form). I was not sure about introducing a new lib in the source code. Thus, I opted for the less impacting solution.
The wrapped solution or a argument for document() could be possible to. I could update the PR for this.
The core of the issue lies in the Android SDK based on OpenJDK, which in turn include Apache Xalan code. Java represents characters using UTF-16 (source). As a result, high Unicode code points are split into surrogate pairs (two char values). When parsed by Xalan's TreeWalker (link), these pairs are processed as two separate characters.
Subsequently, ToXMLStream or ToStream fails to recognize the two utf16 chars as a single Unicode character and encodes each part separately, producing an incorrect surrogate representation. This is a known issue, tracked by the following tickets:
- https://issues.apache.org/jira/browse/XALANJ-2560
- https://issues.apache.org/jira/browse/XALANJ-2419
The ToHTMLStream has been fixed (PR #163), but no release has been published since 2023, so the fix is not available in official SDKs like Android or OpenJDK.
Conclusion: A fix for the Xalan transformer is unlikely to be available in the near future.
I also explored using the Saxon-HE transformer. Unfortunately, Saxon is not designed to support Android. During transformer creation, it depends on javax.xml.transform.stax and javax.xml.stream—two packages that Google chose not to include in Android’s core Java libraries. You can verify this in the list of supported packages: https://developer.android.com/reference/packages.html
To use Saxon, we would need to port those two packages to Android, which is likely beyond the scope of this repository.
That leaves us with three options:
- Adopt a third-party XSLT transformer that supports Android. But which one ?
- Use the regex-based approach proposed in this PR. On 10 runs I have an average of 1m21s for patching the google phone app
- Convert the full string to UTF-16 and back to UTF-8, letting the cast merge surrogate bytes correctly On 10 runs I have an average of 1m18s for patching the google phone app
Regex based : https://github.com/Taknok/revanced-patcher/commit/c20e804b2b1a3672c1df79e96e2fb456b42de54a UTF casting : https://github.com/Taknok/revanced-patcher/commit/ba57bcbaa13349e7ff1c1ddf74d4ff929438cbf8
@oSumAtrIX What would be the best ? the 3rd option with an optional parameter for the context.document(resource, enforce_utf8=true).use... ?
By default the enforce would be false
I was thinking of a more efficient solution which is to hook the output stream and modify it. This avoids writing to string, modifying it, then writing it to the file:
ChatGPT gave this as a starter:
import java.io.OutputStream
import java.io.Writer
import java.io.OutputStreamWriter
class UnicodeCorrectingWriter(outputStream: OutputStream, charset: String = "UTF-8") : Writer() {
private val writer = OutputStreamWriter(outputStream, charset)
private var pendingHighSurrogate: Char? = null
override fun write(cbuf: CharArray, off: Int, len: Int) {
for (i in off until off + len) {
val ch = cbuf[i]
if (Character.isHighSurrogate(ch)) {
pendingHighSurrogate = ch
} else if (Character.isLowSurrogate(ch) && pendingHighSurrogate != null) {
val codePoint = Character.toCodePoint(pendingHighSurrogate!!, ch)
writer.write(String(Character.toChars(codePoint)))
pendingHighSurrogate = null
} else {
if (pendingHighSurrogate != null) {
// Write invalid pending high surrogate as-is
writer.write(pendingHighSurrogate!!.toInt())
pendingHighSurrogate = null
}
writer.write(ch.toInt())
}
}
}
override fun flush() {
if (pendingHighSurrogate != null) {
writer.write(pendingHighSurrogate!!.toInt())
pendingHighSurrogate = null
}
writer.flush()
}
override fun close() {
flush()
writer.close()
}
}
I haven't tested it but maybe this works. Would merge if this approach can be used.
I will look into it, thank for the hint
After quick look, the code never enter the highsurrogate, thus it never correctly encode the emoji.
if (Character.isHighSurrogate(ch)) {
pendingHighSurrogate = ch
Furthermore it takes more than 4min to reach the crash when repacking the res. I will continue to look into it to make it works, but it seems to not be the best performance strategy.
So unfortunately, the chatGPT method does not work as is, as the ToStream serializer write char by char to the stream when we reach this block : https://android.googlesource.com/platform/external/apache-xml/+/650a6cfd4d6b2d38b88ada03694ae19cc448d07b/src/main/java/org/apache/xml/serializer/ToStream.java#1609 The writer has to record the state of the potential surrogate and process the logic (as done in the xalan updated ToStream.java code). The performance of this method is the lower among all. It tooks 1m42s average (when ~1m20s for the others). Due to the complexity of the writer, this method seems to be the less futur proof as it depends of the calls of ToStream.
My tracking branch: https://github.com/Taknok/revanced-patcher/commit/ca2de8eee31b284ec0ca282ee57a35dd660fe7fd
My personnal ranking of the techniques:
- UTF casting https://github.com/Taknok/revanced-patcher/commit/ba57bcbaa13349e7ff1c1ddf74d4ff929438cbf8
- regex https://github.com/Taknok/revanced-patcher/commit/c20e804b2b1a3672c1df79e96e2fb456b42de54a
- writer https://github.com/Taknok/revanced-patcher/commit/ca2de8eee31b284ec0ca282ee57a35dd660fe7fd
Available on discord if needed.
- UTF casting https://github.com/Taknok/revanced-patcher/commit/ba57bcbaa13349e7ff1c1ddf74d4ff929438cbf8
Wait, does setOutputProperty(OutputKeys.ENCODING, "UTF-16") work on Android Runtime!?
Yeah, I also confirmed with Android 6.0 and 15. I didn’t know that at all...
Because it doesn't work on JRE.
On JRE, setOutputProperty(OutputKeys.ENCODING, "XXX") only has an effect when a Document instance is created on the memory.
If a Document is created from an existing XML file, it will have no effect.
References: https://discord.com/channels/952946952348270622/953965039105232906/1333603549904375909 https://stackoverflow.com/questions/15592025/transformer-setoutputpropertyoutputkeys-encoding-utf-8-is-not-working https://bugs.openjdk.org/browse/JDK-8227616
Then, you don't need to use StringWriter or convert to UTF-8.
AAPT2's XML parser understands encoding="UTF-16" and can compile UTF-16 XML files successfully.
So just output with UTF-16.
Adding just one line will fix the issue.
TransformerFactory.newInstance()
.newTransformer()
.apply { setOutputProperty(OutputKeys.ENCODING, "UTF-16") } // Add like this?
.transform(DOMSource(this), StreamResult(stream))
(a condition for checking AndroidRuntime could be added)
If this is all and a conditional check is added, I'd merge it with a comment referencing the Android issue tracker
There is not surrogate issue on JRE as the openjdk correctly implement the conversion. Thus the conditional check is still needed ?
I will try without the UTF8 , but I think y my tests it was not working without
Here are the requested modifications. Tell me if you want some adjustments.
It worked! I patched GPhotos on Manager with Patches 5.19.0 (the last patches version that photos compiling is not fixed) successfully.
You can also use google phone app with this patch to check if issue is fixed : https://github.com/Taknok/revanced-patches
Nope, outputting in UTF-16 caused problems when patching other apps Converting to UTF-8 is probably necessary
Sorry for jumping to conclusions...
- Device Info
ReVanced Manager: 1.25.0-dev.1
Model: Pixel 3
Android version: 15
Supported architectures: arm64-v8a, armeabi-v7a, armeabi
Root permissions: Yes
- Patch Info
App: com.google.android.youtube v20.12.46 (Suggested: 20.12.46)
Patches version: v5.22.0
Patches added: Default
Patches removed: None
Default patch options changed: None
- Settings
Allow changing patch selection: false
Version compatibility check: true
Show universal patches: false
Patches source: revanced/revanced-patches
- Logs
Reading APK
Decoding app manifest
Loading patches
Deleting existing temporary files directory
Decoding resources
Initializing lookup maps
Executing patches
Applied 52 patches
GmsCore support failed: app.revanced.patcher.patch.PatchException: The patch "GmsCore support" depends on "ResourcePatch", which raised an exception:
app.revanced.patcher.patch.PatchException: Unexpected token (position:TEXT ��
Draft waiting UTF8 revert
In the end, we went back to the approach mentioned in LisoUseInAIKyrios's first comment.
But I'm thinking of using a FileWriter directly instead of using a StringWriter. Since FileWriter uses the system default encoding, and Android's default encoding is always UTF-8, I expect it to write in UTF-8.
If this were possible, the serialized result would be written directly to a file, so there would be no performance impact. I'll try this tomorrow
This function patches youtube and google phone without error:
override fun close() {
file?.let {
if (readerCount[it]!! > 1) {
throw IllegalStateException(
"Two or more instances are currently reading $it." +
"To be able to close this instance, no other instances may be reading $it at the same time.",
)
} else {
readerCount.remove(it)
}
val transformer = TransformerFactory.newInstance().newTransformer().apply {
if (isAndroid) {
setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes")
setOutputProperty(OutputKeys.ENCODING, "UTF-16")
}
}
it.writer().use { writer ->
transformer.transform(DOMSource(this), StreamResult(writer))
}
}
}
Why is
setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes")
needed?
Because we want to have an UTF-8 XML at the end, thus we have to explicitly say to not write the encoding as we perform the conversion : it.writer() set by default a charsets.UTF-8
So as I understand the transformer writes as utf16 and then the writer as utf8? Why are we omitting the xml declaration. It doesn't have to do anything with encoding from what I am seeing
The origin:
- App
- Apktool decode XML to UTF-8
- Java read UTF-8 and set it well in memory, string IB memory handles wells UTF-8. When writing to a file on the string it calls .toCharsArray() which encode it into a 16bits chars list (but some unicode needs overflow and are split into 2 16bits chars) 4.1 openjdk, when writing, is aware of this potential issue. So it handle the writing by checking if the 16chars is a high point unicode and if the encoding is UTF-8, if yes, it goes to a special flow writing the 16charsunicode. 4.2 Android is not aware of it, it thinks that it is handling UTF-8 that are only 16chars, if this unicode is unknown, so it writes it as an entity as would do JavaScript.
- Apktool recompiles the res, it do not understand surrogate unicode, if it reads one (like in Android senario) it crashes.
The fix:
- In memory we have a UTF-8 string with correct high point emoji.
- The toStream class will be called and split the strings with toCharsArray() (else we need to rewrite the whole ToStream from xalan, I tried this but there is private class etc in xalan increasing complexity and code copy).
- We tell to the writer that it will handle 16 bits char (UTF-16), but the charset used will be UTF-8 (default value of the file writers), so the output file should not have an xml header forcing to UTF-18 as we used UTF-8 charset.
- Aapt2 reads the file with UTF-8 and high point encoded value.
If you want to follow the flow in the debugging mode, you have to set a breaking in the characters() function for the ToStream.java class (called by the xml TreeWalker.java) and follow step by step the code for each case
So as I understand the transformer writes as utf16 and then the writer as utf8?
Yes.
We are omitting because the resulting xml file is written with UTF-8 and aapt2 will read using UTF-8. Setting the headers in the xml file may leads to an aapt2 reading UTF-8 value thinking it is utf-16 (as declared in the headers)
I know this is hacky, but if google was consistent and keeps only 16bits chars with surrogate support OR UTF-8 correct writing this would not be needed :/
Can't you set the encoding in the header to utf8 explicitly but write in utf16?
I did not find a way to do it, but I also did not find a proof that is not feasible.
For JRE, there is no need to use Writer I suggest this code Also I tried adding some comments, I hope it helps
val transformer = TransformerFactory.newInstance().newTransformer()
if (isAndroid) {
// Set to UTF-16 to prevent surrogate pairs from being escaped to broken numeric character references.
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16")
// The XML declaration will have encoding="UTF-16", but we're going to write it back in UTF-8, so omit it.
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes")
// Use FileWriter to output the XML file in UTF-8 encoding.
it.writer().use { writer ->
transformer.transform(DOMSource(this), StreamResult(writer))
}
} else {
transformer.transform(DOMSource(this), StreamResult(it))
}
Perf test with youtube and default patch: Manager main branch + patcher main with surrogate fix : 334s Manager main branch + patcher main branch vanilla : 349s