project-replicator xml parser skips files with DOCTYPE entries

xml parser skips files with DOCTYPE entries

Open jhansche opened this issue 1 year ago • 0 comments

> Task :my-module:gatherModuleInfo
e: Invalid xml file my-module/src/main/res/values/strings.xml
   line 4; column 10: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
   <!DOCTYPE resources [
            ↑
Skipping file parsing

This does not terminate the process, it just skips the file.

I'm not sure what the reason is for disallowing DOCTYPE, but it is useful to add named entities for unusual characters. For example, we use it to define entity aliases like these that we can then use in our strings:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE resources [
    <!ENTITY ldquo  "&#8220;">
    <!ENTITY rdquo  "&#8221;">
    <!ENTITY lsquo  "&#8216;">
    <!ENTITY rsquo  "&#8217;">
    <!ENTITY hellip "&#8230;">
    <!ENTITY prime  "&#8242;">
    <!ENTITY Prime  "&#8243;">
    <!ENTITY bull   "&#8226;">
    <!ENTITY thinsp "&#8201;">
    <!ENTITY hairsp "&#8202;">
    ]>

Then later we can refer to these standard entity names in the strings:

<string name="string_name">One last thing&hellip;</string>

In this case, we prefer to use … here, because it is more meaningful for translators, and it is more grammatically correct. I.e., using … vs .... It also translates differently for some languages - for example, some languages prefer a different type of ellipsis, like the midline (⋯) or vertical ellipsis (⋮).

Inlining the unicode character can often be difficult for people reading the file to understand that it is a unicode character rather than its similar non-unicode counterpart (i.e., ' apostrophe vs ’ right-single quotation or rsquo), which is why we use the &#<>; notation. And inlining that notation into the string, someone reading the file won't understand what that number represents unless they look it up.

So the workaround is using the named XML entity: it gives us the exact unicode representation, with a meaningful name, without compromising the character width which can have an impact on some older editors that aren't well equipped to handle multi-byte unicode characters.

May 16 '23 16:05 jhansche

project-replicator project-replicator copied to clipboard

xml parser skips files with DOCTYPE entries

project-replicator
project-replicator copied to clipboard