revanced-patcher icon indicating copy to clipboard operation
revanced-patcher copied to clipboard

fix: Use correct surrogate pairs for emojis

Open Taknok opened this issue 9 months ago • 31 comments

The android implementation of the transformer encodes the emoji in a form called "surrogate", but aapt and apktool do not understand the surrogate form. Thus, detect surrogate and recompute the traditional code point value before writing it to file allowing aapt and apktool to rebuild the app successfully.

It should fix some cases (mainly for google app) of https://github.com/ReVanced/revanced-manager/issues/2142

Taknok avatar Apr 19 '25 18:04 Taknok

For reference, this different solution appears to fix the same issue by exporting to a utf16 string then recoding to utf8.

At first glance I'm not sure if it's better or worse than the solution in this PR.

LisoUseInAIKyrios avatar Apr 19 '25 18:04 LisoUseInAIKyrios

I tried the modifications in the abandoned PR without success for the rebuild of the google phone app.

Should I rebase my branch on dev ?

Taknok avatar Apr 19 '25 19:04 Taknok

Rebasing is best since it's merging to dev.

LisoUseInAIKyrios avatar Apr 19 '25 19:04 LisoUseInAIKyrios

I dislike band-aid solutions. For a couple of painful days I am experimenting a proper solution such as using a different transformer implementation like Saxon. Hopefully a proper solution can be found. If not, I'd prefer to outsource this fix-up solution as a wrapped document() API in patches repo

oSumAtrIX avatar Apr 19 '25 20:04 oSumAtrIX

I agree, but the cleanest would be to fix the android transformer (to keep unicode point) or the aapt code (to handle surrogate form). I was not sure about introducing a new lib in the source code. Thus, I opted for the less impacting solution.

The wrapped solution or a argument for document() could be possible to. I could update the PR for this.

Taknok avatar Apr 19 '25 21:04 Taknok

The core of the issue lies in the Android SDK based on OpenJDK, which in turn include Apache Xalan code. Java represents characters using UTF-16 (source). As a result, high Unicode code points are split into surrogate pairs (two char values). When parsed by Xalan's TreeWalker (link), these pairs are processed as two separate characters.

Subsequently, ToXMLStream or ToStream fails to recognize the two utf16 chars as a single Unicode character and encodes each part separately, producing an incorrect surrogate representation. This is a known issue, tracked by the following tickets:

  • https://issues.apache.org/jira/browse/XALANJ-2560
  • https://issues.apache.org/jira/browse/XALANJ-2419

The ToHTMLStream has been fixed (PR #163), but no release has been published since 2023, so the fix is not available in official SDKs like Android or OpenJDK.

Conclusion: A fix for the Xalan transformer is unlikely to be available in the near future.

I also explored using the Saxon-HE transformer. Unfortunately, Saxon is not designed to support Android. During transformer creation, it depends on javax.xml.transform.stax and javax.xml.stream—two packages that Google chose not to include in Android’s core Java libraries. You can verify this in the list of supported packages: https://developer.android.com/reference/packages.html

To use Saxon, we would need to port those two packages to Android, which is likely beyond the scope of this repository.

That leaves us with three options:

  1. Adopt a third-party XSLT transformer that supports Android. But which one ?
  2. Use the regex-based approach proposed in this PR. On 10 runs I have an average of 1m21s for patching the google phone app
  3. Convert the full string to UTF-16 and back to UTF-8, letting the cast merge surrogate bytes correctly On 10 runs I have an average of 1m18s for patching the google phone app

Regex based : https://github.com/Taknok/revanced-patcher/commit/c20e804b2b1a3672c1df79e96e2fb456b42de54a UTF casting : https://github.com/Taknok/revanced-patcher/commit/ba57bcbaa13349e7ff1c1ddf74d4ff929438cbf8

@oSumAtrIX What would be the best ? the 3rd option with an optional parameter for the context.document(resource, enforce_utf8=true).use... ? By default the enforce would be false

Taknok avatar Apr 22 '25 23:04 Taknok

I was thinking of a more efficient solution which is to hook the output stream and modify it. This avoids writing to string, modifying it, then writing it to the file:

ChatGPT gave this as a starter:

import java.io.OutputStream
import java.io.Writer
import java.io.OutputStreamWriter

class UnicodeCorrectingWriter(outputStream: OutputStream, charset: String = "UTF-8") : Writer() {
    private val writer = OutputStreamWriter(outputStream, charset)
    private var pendingHighSurrogate: Char? = null

    override fun write(cbuf: CharArray, off: Int, len: Int) {
        for (i in off until off + len) {
            val ch = cbuf[i]
            if (Character.isHighSurrogate(ch)) {
                pendingHighSurrogate = ch
            } else if (Character.isLowSurrogate(ch) && pendingHighSurrogate != null) {
                val codePoint = Character.toCodePoint(pendingHighSurrogate!!, ch)
                writer.write(String(Character.toChars(codePoint)))
                pendingHighSurrogate = null
            } else {
                if (pendingHighSurrogate != null) {
                    // Write invalid pending high surrogate as-is
                    writer.write(pendingHighSurrogate!!.toInt())
                    pendingHighSurrogate = null
                }
                writer.write(ch.toInt())
            }
        }
    }

    override fun flush() {
        if (pendingHighSurrogate != null) {
            writer.write(pendingHighSurrogate!!.toInt())
            pendingHighSurrogate = null
        }
        writer.flush()
    }

    override fun close() {
        flush()
        writer.close()
    }
}

I haven't tested it but maybe this works. Would merge if this approach can be used.

oSumAtrIX avatar May 02 '25 13:05 oSumAtrIX

I will look into it, thank for the hint

Taknok avatar May 02 '25 13:05 Taknok

After quick look, the code never enter the highsurrogate, thus it never correctly encode the emoji.

if (Character.isHighSurrogate(ch)) {
  pendingHighSurrogate = ch

Furthermore it takes more than 4min to reach the crash when repacking the res. I will continue to look into it to make it works, but it seems to not be the best performance strategy.

Taknok avatar May 03 '25 13:05 Taknok

So unfortunately, the chatGPT method does not work as is, as the ToStream serializer write char by char to the stream when we reach this block : https://android.googlesource.com/platform/external/apache-xml/+/650a6cfd4d6b2d38b88ada03694ae19cc448d07b/src/main/java/org/apache/xml/serializer/ToStream.java#1609 The writer has to record the state of the potential surrogate and process the logic (as done in the xalan updated ToStream.java code). The performance of this method is the lower among all. It tooks 1m42s average (when ~1m20s for the others). Due to the complexity of the writer, this method seems to be the less futur proof as it depends of the calls of ToStream.

My tracking branch: https://github.com/Taknok/revanced-patcher/commit/ca2de8eee31b284ec0ca282ee57a35dd660fe7fd

My personnal ranking of the techniques:

  1. UTF casting https://github.com/Taknok/revanced-patcher/commit/ba57bcbaa13349e7ff1c1ddf74d4ff929438cbf8
  2. regex https://github.com/Taknok/revanced-patcher/commit/c20e804b2b1a3672c1df79e96e2fb456b42de54a
  3. writer https://github.com/Taknok/revanced-patcher/commit/ca2de8eee31b284ec0ca282ee57a35dd660fe7fd

Available on discord if needed.

Taknok avatar May 05 '25 20:05 Taknok

  1. UTF casting https://github.com/Taknok/revanced-patcher/commit/ba57bcbaa13349e7ff1c1ddf74d4ff929438cbf8

Wait, does setOutputProperty(OutputKeys.ENCODING, "UTF-16") work on Android Runtime!?

Yeah, I also confirmed with Android 6.0 and 15. I didn’t know that at all...

Because it doesn't work on JRE. On JRE, setOutputProperty(OutputKeys.ENCODING, "XXX") only has an effect when a Document instance is created on the memory. If a Document is created from an existing XML file, it will have no effect.

References: https://discord.com/channels/952946952348270622/953965039105232906/1333603549904375909 https://stackoverflow.com/questions/15592025/transformer-setoutputpropertyoutputkeys-encoding-utf-8-is-not-working https://bugs.openjdk.org/browse/JDK-8227616


Then, you don't need to use StringWriter or convert to UTF-8. AAPT2's XML parser understands encoding="UTF-16" and can compile UTF-16 XML files successfully. So just output with UTF-16.

Adding just one line will fix the issue.

TransformerFactory.newInstance()
    .newTransformer()
    .apply { setOutputProperty(OutputKeys.ENCODING, "UTF-16") }  // Add like this?
    .transform(DOMSource(this), StreamResult(stream))

(a condition for checking AndroidRuntime could be added)

kitadai31 avatar May 08 '25 13:05 kitadai31

If this is all and a conditional check is added, I'd merge it with a comment referencing the Android issue tracker

oSumAtrIX avatar May 08 '25 13:05 oSumAtrIX

There is not surrogate issue on JRE as the openjdk correctly implement the conversion. Thus the conditional check is still needed ?

I will try without the UTF8 , but I think y my tests it was not working without

Taknok avatar May 08 '25 13:05 Taknok

Here are the requested modifications. Tell me if you want some adjustments.

Taknok avatar May 08 '25 15:05 Taknok

It worked! I patched GPhotos on Manager with Patches 5.19.0 (the last patches version that photos compiling is not fixed) successfully.

kitadai31 avatar May 08 '25 16:05 kitadai31

You can also use google phone app with this patch to check if issue is fixed : https://github.com/Taknok/revanced-patches

Taknok avatar May 08 '25 16:05 Taknok

Nope, outputting in UTF-16 caused problems when patching other apps Converting to UTF-8 is probably necessary

Sorry for jumping to conclusions...

- Device Info
ReVanced Manager: 1.25.0-dev.1
Model: Pixel 3
Android version: 15
Supported architectures: arm64-v8a, armeabi-v7a, armeabi
Root permissions: Yes

- Patch Info
App: com.google.android.youtube v20.12.46 (Suggested: 20.12.46)
Patches version: v5.22.0
Patches added: Default
Patches removed: None
Default patch options changed: None

- Settings
Allow changing patch selection: false
Version compatibility check: true
Show universal patches: false
Patches source: revanced/revanced-patches

- Logs
Reading APK
Decoding app manifest
Loading patches
Deleting existing temporary files directory
Decoding resources
Initializing lookup maps
Executing patches
Applied 52 patches
GmsCore support failed: app.revanced.patcher.patch.PatchException: The patch "GmsCore support" depends on "ResourcePatch", which raised an exception:
app.revanced.patcher.patch.PatchException: Unexpected token (position:TEXT ��

kitadai31 avatar May 08 '25 17:05 kitadai31

Draft waiting UTF8 revert

Taknok avatar May 08 '25 19:05 Taknok

In the end, we went back to the approach mentioned in LisoUseInAIKyrios's first comment.

But I'm thinking of using a FileWriter directly instead of using a StringWriter. Since FileWriter uses the system default encoding, and Android's default encoding is always UTF-8, I expect it to write in UTF-8.

If this were possible, the serialized result would be written directly to a file, so there would be no performance impact. I'll try this tomorrow

kitadai31 avatar May 09 '25 18:05 kitadai31

This function patches youtube and google phone without error:

    override fun close() {
        file?.let {
            if (readerCount[it]!! > 1) {
                throw IllegalStateException(
                    "Two or more instances are currently reading $it." +
                        "To be able to close this instance, no other instances may be reading $it at the same time.",
                )
            } else {
                readerCount.remove(it)
            }

            val transformer = TransformerFactory.newInstance().newTransformer().apply {
                if (isAndroid) {
                    setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes")
                    setOutputProperty(OutputKeys.ENCODING, "UTF-16")
                }
            }

            it.writer().use { writer ->
                transformer.transform(DOMSource(this), StreamResult(writer))
            }
        }
    }

Taknok avatar May 09 '25 20:05 Taknok

Why is

                    setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes")

needed?

oSumAtrIX avatar May 10 '25 15:05 oSumAtrIX

Because we want to have an UTF-8 XML at the end, thus we have to explicitly say to not write the encoding as we perform the conversion : it.writer() set by default a charsets.UTF-8

Taknok avatar May 10 '25 16:05 Taknok

So as I understand the transformer writes as utf16 and then the writer as utf8? Why are we omitting the xml declaration. It doesn't have to do anything with encoding from what I am seeing

oSumAtrIX avatar May 10 '25 16:05 oSumAtrIX

The origin:

  1. App
  2. Apktool decode XML to UTF-8
  3. Java read UTF-8 and set it well in memory, string IB memory handles wells UTF-8. When writing to a file on the string it calls .toCharsArray() which encode it into a 16bits chars list (but some unicode needs overflow and are split into 2 16bits chars) 4.1 openjdk, when writing, is aware of this potential issue. So it handle the writing by checking if the 16chars is a high point unicode and if the encoding is UTF-8, if yes, it goes to a special flow writing the 16charsunicode. 4.2 Android is not aware of it, it thinks that it is handling UTF-8 that are only 16chars, if this unicode is unknown, so it writes it as an entity as would do JavaScript.
  4. Apktool recompiles the res, it do not understand surrogate unicode, if it reads one (like in Android senario) it crashes.

The fix:

  1. In memory we have a UTF-8 string with correct high point emoji.
  2. The toStream class will be called and split the strings with toCharsArray() (else we need to rewrite the whole ToStream from xalan, I tried this but there is private class etc in xalan increasing complexity and code copy).
  3. We tell to the writer that it will handle 16 bits char (UTF-16), but the charset used will be UTF-8 (default value of the file writers), so the output file should not have an xml header forcing to UTF-18 as we used UTF-8 charset.
  4. Aapt2 reads the file with UTF-8 and high point encoded value.

If you want to follow the flow in the debugging mode, you have to set a breaking in the characters() function for the ToStream.java class (called by the xml TreeWalker.java) and follow step by step the code for each case

Taknok avatar May 10 '25 17:05 Taknok

So as I understand the transformer writes as utf16 and then the writer as utf8?

Yes.

We are omitting because the resulting xml file is written with UTF-8 and aapt2 will read using UTF-8. Setting the headers in the xml file may leads to an aapt2 reading UTF-8 value thinking it is utf-16 (as declared in the headers)

Taknok avatar May 10 '25 17:05 Taknok

I know this is hacky, but if google was consistent and keeps only 16bits chars with surrogate support OR UTF-8 correct writing this would not be needed :/

Taknok avatar May 10 '25 17:05 Taknok

Can't you set the encoding in the header to utf8 explicitly but write in utf16?

oSumAtrIX avatar May 10 '25 17:05 oSumAtrIX

I did not find a way to do it, but I also did not find a proof that is not feasible.

Taknok avatar May 10 '25 18:05 Taknok

For JRE, there is no need to use Writer I suggest this code Also I tried adding some comments, I hope it helps

            val transformer = TransformerFactory.newInstance().newTransformer()
            if (isAndroid) {
                // Set to UTF-16 to prevent surrogate pairs from being escaped to broken numeric character references.
                transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16")
                // The XML declaration will have encoding="UTF-16", but we're going to write it back in UTF-8, so omit it.
                transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes")
                // Use FileWriter to output the XML file in UTF-8 encoding.
                it.writer().use { writer ->
                    transformer.transform(DOMSource(this), StreamResult(writer))
                }
            } else {
                transformer.transform(DOMSource(this), StreamResult(it))
            }

kitadai31 avatar May 10 '25 18:05 kitadai31

Perf test with youtube and default patch: Manager main branch + patcher main with surrogate fix : 334s Manager main branch + patcher main branch vanilla : 349s

Taknok avatar May 11 '25 21:05 Taknok