libass icon indicating copy to clipboard operation
libass copied to clipboard

Discussion: libass' {,}-escapes and VSFilter-compatability

Open avih opened this issue 10 years ago • 34 comments

Not sure if it's a bug, but I expected this to work: x{\u1}\\{\u0}x - it didn't, and I also tried the following: x{\u1}\{\u0}x, x{\u1}\u2764{\u0}x, which also didn't work (tested with libass 0.12.3, and also with the fonts branch from 2015-09-03)

I'd appreciate help on how to do that, or otherwise please treat it as a bug report.

avih avatar Sep 09 '15 00:09 avih

Commit a6fe61a3 introduced an escape mechanism to make it possible to print the characters '}' and '{'. This clearly interferes with what you are doing here, and I believe we had some discussion about this problem a long time ago. I'm not sure what the outcome was.

grigorig avatar Sep 09 '15 09:09 grigorig

Hmm.. would it make sense then to add another escape case (to the changes of a6fe61a) like this?

        } else if (p[1] == '\\') {
            p += 2;
            *str = p;
            return '\\';

since \ is clearly an escape control char (can be followed by N/n/h/{/}), it should be possible to output it, and the common approach is to escape it as well, such that x{\u1}\\{\u0}x would work.

avih avatar Sep 09 '15 10:09 avih

The commit is actually wrong. It breaks things that work in VSFilter (there was at least 1 real life case).

ghost avatar Sep 09 '15 10:09 ghost

Is it wrong because it breaks compatibility? or because it should/could have been solved more compliantly, e.g. maybe using {\}} ? (which could be extended to also escape backslash, e.g. c:\New Folder\ to be printed using c:{\\}New Folder\ )

In general though, the fact that \ can be used both for escaping (e.g. \n) and as plain literal if it's not followed by a recognized escape sequence, is a recipe for trouble IMO.

IMO either confine all escaping to within {...}, or always treat \ as an escape control char.

avih avatar Sep 09 '15 10:09 avih

Is it wrong because it breaks compatibility?

Yes.

ghost avatar Sep 09 '15 10:09 ghost

How would VSFilter be used to output a literal \ or { or } ?

avih avatar Sep 09 '15 11:09 avih

It can't.

ghost avatar Sep 09 '15 11:09 ghost

In that case, would these work without breaking compatibility for stuff which does work (assuming VSFilter would break on those)? {\{} for { {\}} for } {\\} for \

And possibly for other literals that are unprintable with VSFilter, maybe even also for unicode sequences such as {\uHHHH} (H- Hex digit) ?

avih avatar Sep 09 '15 11:09 avih

Is the goal of libass to maintain 100% compatibility with VSFilter, including limitations such as no way to output e.g. {?

Or to support everything which works in VSFilter, but allow some extensions to overcome limitations of VSFilter? (such that some stuff which libass supports would not work in VSFilter).

avih avatar Sep 09 '15 11:09 avih

I think the consensus is that we want to be as compatible to VSFilter as needed. If there is a real-life case that breaks with the escape feature, we should probably consider removing or changing it.

grigorig avatar Sep 11 '15 13:09 grigorig

Your reply implies that the existing (5 years old) patch to support \{ etc should be reverted, and while this is a meaningful decision, it still does not help with the original issue of this bug.

Therefore, I think a more relevant question is whether or not libass wants to extend on VSFilter where VSFilter has limitations (e.g. to allow backslash literals):

  • Without breaking anything which works in VSFilter (the existing \{ support already doesn't comply with this).
  • But ending up with features which libass supports but which would not work in VSFilter (the \{ support already does this).

My suggestion is to add new escape format which doesn't conflict with existing tags, something similar to:

{\{} for { {\}} for } {\\} for \

Which would not break VSFilter as far as I can tell, but also which wouldn't work in VSFilter.

avih avatar Sep 11 '15 13:09 avih

Going back to your original question, you can break the \{ escape in libass by adding a zero-width space (or any other zero-width character) after the backslash.

Sadly, there’s literally no way to display a backslash (before n, N or h) or an opening brace in VSFilter. A closing brace is easy, though: it doesn’t need any escaping at all. Every override tag block consumes the nearest closing brace, and any closing brace that hasn’t been consumed is displayed verbatim.

Is the goal of libass to maintain 100% compatibility with VSFilter, including limitations such as no way to output e.g. {?

Personally, in my eyes the goal of libass is to maintain 100% compatibility with all scripts designed for VSFilter. Additional features are acceptable as long as they don’t affect scripts that aren’t knowingly targeting them.

Of course, technically we can never be entirely sure because anyone could activate a libass feature by accident while designing for VSFilter, but the rule of thumb is that if we’ve never heard of anyone doing this and it’s incredibly unlikely (e. g. we have a whole new ASS header with a name dissimilar to any other, so a simple typo isn’t enough to activate it), then it’s OK. Sometimes we happen to accept features that aren’t that hard to activate by accident, such as these escapes or BorderStyle=4 (#105), and then we may end up having to back out and remove or change them later when someone discovers a script that they do break.

@wm4 Could you perhaps dig out that real-life case?

astiob avatar Sep 11 '15 14:09 astiob

See also this old related discussion over at xy-VSFilter: https://code.google.com/p/xy-vsfilter/issues/detail?id=149. (Do note that the discussion originator is none other than our own @wm4.) Everyone seemed to agree that it would be nice to have some portable way of outputting literal backslashes and braces, but no consensus was reached on the specifics, and then the discussion (and later xy-VSFilter development) died out.

Maybe we should try to restart the discussion with more/other people: first of all, the developers of MPC-HC’s VSFilter (@Underground78, @kasper93) and Aegisub (@tgoyne), but I think we should also invite fansubbers (@line0, @torque, @Daiz?). Of course, backslashes and braces don’t occur often in fansubs, but they will still be stuck with having to use whatever syntax we come up with if the need does arise, and it would be nice and potentially interesting to hear what they have to say about it.

The ideal solution:

  • allows new scripts to display literal backslashes and braces at will in all new renderers,
  • makes any and all backslashes in old scripts except those in \n, \N and \h display as literal backslashes in all new renderers,
  • attempts to make new scripts degrade as gracefully as possible in old renderers but does not necessarily make any strong guarantees.

We may also want to similarly allow new scripts explicitly say they want literal angular brackets displayed to ensure VSFilter doesn’t parse them as HTML tags. (But this is already achievable by inserting a {} somewhere within the tag.)

Then again, judging by the recent commits of mpc-hc/mpc-hc@11bb014771fcc1f73f4b92c06d29f0662fdda457 and mpc-hc/mpc-hc@b20e86fad2a5d7729638d772a0a5cd144f9e7189, MPC-HC may not be interested in preserving compatibility with existing scripts all that much.

Currently, I see the following ways of achieving this, including those originally proposed by @wm4:

  1. Add one or more new override tags:
    1. Add individual override tags that don’t override anything and instead output the backslash, the opening brace and the closing brace (the last one is optional, because a closing brace that is not paired to a previous opening brace is literal anyway). This has been the most suggested option in the previous discussion, with lots of different potential names for the tags. Example: {\fbl}text in braces{\fbr}
    2. Add an override tag that doesn’t override anything and instead outputs an arbitrary Unicode character. Specifying the character literally will break old parsers, so specify its code point instead. Example: {\U&H7b}text in braces{\U&H7d}
    3. Add an override tag that doesn’t override anything and instead outputs an arbitrary literal string. Whatever the details, this will inevitably wreak havoc in old parsers when backslashes and braces are involved.
      1. Make it take the literal string as an argument. This complicates parsing and breaks compatibility with old renderers if the backslash or the closing brace occur in the argument. Further, we now need yet another escape mechanism to escape the argument delimiter! Example: {\lit({text in braces})}
      2. Make it take a number n as an argument and take the following n bytes or characters (beware of surrogate pairs) as the literal string. The literal string can be kept inside the brace-enclosed override tag block or outside (in an attempt to improve degradation in old parsers). Example: {\lit16{text in braces}}
    4. Add an override tag that switches plain-text parsing to a new mode. Like \p switches between plain text and vector drawings, this tag will switch between plain text that has only \n, \N and \h and plain text that has arbitrary-character escapes, possibly even Unicode-code-point escapes. Example: {\e1}\{text in braces\}
  2. Add a new header that switches plain-text parsing in the whole file to a new mode just like in option 1.iv.

Of these, a new header is by far the easiest to use as a script writer. For automatic conversion from other subtitle formats where human-readable output is not a goal, any option is as good as any other. Finally, in terms of graceful degradation, individual override tags for the affected characters or a single-character Unicode escape tag is probably best, as they let all unaffected characters show up in old renderers and handle all affected characters consistently. Of course, a dedicated script writer can use some of the other options with varying levels of graceful degradation too: for example, \{{}text in braces} shows up correctly in current libass and merely substitutes a backslash for the opening brace in current VSFilter.

astiob avatar Sep 11 '15 14:09 astiob

Something that doesn't break the normal parsing rules would be nice. There's a lot of code out there which merely scans for '{' and '}' to distinguish text and tags.

ghost avatar Sep 11 '15 16:09 ghost

PS: currently, my favorite choice would be {\U&H7b}. (Not too fond of the VB style prefix - but at least it's consistent.)

ghost avatar Sep 11 '15 16:09 ghost

For what it's worth, I believe the classic workaround for vsfilter involved a fullwidth { and some negative \fsp magic.

I think having a generic unicode character escape syntax is a good thing, but I think {\U&H7b} is confusing. Nothing else inside of {} ever prints characters, and making an exception to that seems really inconsistent. I'd be more inclined to spring for some kind of \uXXXX syntax within the lines themselves, but, this may be troublesome to parse, and it may also cause backwards compatibility issues.

I personally think a new header is the cleanest solution to pretty much all problems with ASS and versions, and there's already been precedent set for introducing new headers (YCbCr Matrix) to change behavior in a backwards compatible manner.

torque avatar Sep 11 '15 17:09 torque

PS: currently, my favorite choice would be {\U&H7b}. (Not too fond of the VB style prefix - but at least it's consistent.)

I'd be more inclined to spring for some kind of \uXXXX syntax within the lines themselves, but, this may be troublesome to parse, and it may also cause backwards compatibility issues.

How about {\UHHHH} ? existing parsers should ignore the unknown U tag, and it's still close enough to the mostly familiar \uHHHH.

avih avatar Sep 11 '15 18:09 avih

There's a lot of code out there which merely scans for '{' and '}' to distinguish text and tags.

Very good point. This is why I used \fbl and \fbr rather than \{ and \} for override tags in my example, but of course, the same applies to text, too. If we go for specific new escapes in text, we don’t necessarily have to use \{ and \}: for example, we could have \l and \r or even \[ and \].

astiob avatar Sep 12 '15 09:09 astiob

PS: currently, my favorite choice would be {\U&H7b}. (Not too fond of the VB style prefix - but at least it's consistent.)

I think I also like this the most due to the consistency.

avih avatar Sep 15 '15 03:09 avih

I also like this, seems to be a good and clean approach. Only nitpick, while ASS overrides are case sensitive, the similarity with \u for underline may confuse some users. Not really a problem, though.

grigorig avatar Sep 15 '15 10:09 grigorig

I think having a generic unicode character escape syntax is a good thing, but I think {\U&H7b} is confusing. Nothing else inside of {} ever prints characters, and making an exception to that seems really inconsistent.

Indeed, not only would this completely destroy ASS semantics, it would also raise interesting questions as far as state is concerned. Consider this hypothetical line: A {\i1\U&H42&\b1} C. Would this format to A B C or to A B C? Now the second option (introducing additional intermediate state wherever a code point insert appears) is terrible for obvious reasons, but the first option (treating a \U as a promise to insert the character at the specified codepoint at the start of the next text section) is counterintuitive too, especially when you start thinking about what happens when you have two code points in that override block. Would you suddenly care about order again, or perhaps treat it as an override tag and only ever insert one character per block (which would effectively prevent you from ever being able to write {{ without sneaking some non-printable character in between)?

This kind of example might appear contrived, given how nobody would voluntarily mix markup with not-quite-markup but this is where compatibility comes in: My scripts will happily simplify A{\i1}{\U&H42&}{\b1}C into A{\i1\U&H42&\b1}C or even move unknown tags to the end: A{\b1\i1\U&H42&}C. Fixing this would be major work to the point of abandoning the notion of override blocks altogether.

I can see how this solution appeals to some of you from an aesthetics standpoint (incompatible renderers would simply omit the characters inserted with \U), but a fail-silently solution would ultimately make your viewers assume you simply can't spell for shit rather than considering compatibility issues to be the culprit.

I'd be more inclined to spring for some kind of \uXXXX syntax within the lines themselves, but, this may be troublesome to parse, and it may also cause backwards compatibility issues.

I'm also in favor of this and while it is just as incompatible as the override tag solution, it will break scripts and renderers in very obvious ways while an override tag would often mask the issue. Actually, as far as scripts are concerned - those which are only concerned with override tags in existing tag blocks (e.g. tag stripping) wouldn't run into any compatibility issues whatsoever as long as you keep the escape strings out of the tag blocks.

I personally think a new header is the cleanest solution to pretty much all problems with ASS and versions

Indeed. Drop \{ support for files that lack the new header and support standard escape sequences such as \uXXXX for files that it.

That said, if you were really set on implementing this as an override tag, there's a way to do it while keeping ASS semantics intact: Instead of having {\U&H42&} insert a character, make it work like a proper override tag and have it turn every text character following this tag block (until \r, \U or \U&HXX& is encountered) into the specified unicode code point. While that sure wouldn't be pretty, it would at least be consistent and i can already think of creative ways to abuse such a feature.

line0 avatar Sep 15 '15 21:09 line0

Indeed. Drop \{ support for files that lack the new header and support standard escape sequences such as \uXXXX for files that it.

I think this is my favourite option too.

i can already think of creative ways to abuse such a feature.

Out of curiosity, what might those be?

astiob avatar Sep 15 '15 21:09 astiob

Out of curiosity, what might those be?

{\fnFansub Block\U&H41&}This will make torque mad Seriously, though, you could use it to visually censor expletives while still maintaining an unmolested script, thus making the style choice reversible.

line0 avatar Sep 15 '15 22:09 line0

Ooh, neat! (Not that this should significantly influence our decision.)

astiob avatar Sep 15 '15 22:09 astiob

I don't understand your concerns. Drawing mode already influences how text outside of the override tags is treated.

ghost avatar Sep 18 '15 14:09 ghost

PS: and if you want total backwards compatibility, this might be the only way.

ghost avatar Sep 18 '15 15:09 ghost

I’m confused by your latest comments, @wm4.

@line0’s concerns are about the syntax where an override tag itself produces output, rather than affects the interpretation of text outside of override tags. It’s unintuitive to humans, and it can break automatic processing of ASS, or rather cause existing automatic processing tools to break new scripts.

Admittedly, ASS tags can’t be freely reordered even now (think \r) and it’s easy to say that such tools should be smart and leave unknown tags alone and in place precisely because touching or moving them can start breaking scripts in the future, but we can also be sympathetic to people writing such tools when they don’t expect ASS to get new position-sensitive tags in the future.

Everyone speaks of a new header and I find it most attractive myself, but there is also the option of adding an override tag that enables new escapes in the rest of the line—escapes that are themselves written outside of override tag blocks. @line0’s expressed concerns do not apply to such a solution.

astiob avatar Sep 18 '15 20:09 astiob

an override tag itself produces output, rather than affects the interpretation of text outside of override tags.

Well yes, but it's not like it's been terribly elegant and orthogonal before either.

it can break automatic processing of ASS, or rather cause existing automatic processing tools to break new scripts.

But why does it matter if old tools misinterprets our shiny new escapes? They won't do anything reasonable with them, because they, duh, don't support them. If you don't want anything to break, then you can't have your feature.

ghost avatar Sep 19 '15 19:09 ghost

There’s nothing for them to do with them. We’re talking about batch processing tools that don’t care about the text, just the tags. Aegisub plugins probably, and standalone tools too. Definitely not renderers. Just as you said such there’s plenty of code that takes { and } as override tag delimiters so it’s better to avoid escapes like \{, there’s also plenty of code that assumes all override tags other than \r can be freely reordered within an override tag block and consecutive blocks with no text in between can be merged (and reordered).

All he is saying is that if our shiny new escapes break this assumption, this will break much more existing code than if they don’t, for no apparent benefit. If we use a header or a tag that changes how non-tag text is parsed, we’ll still have our shiny new escapes but we’ll also uphold this assumption and thus keep anything that doesn’t care about the actual text (i. e. pretty much everything but renderers) in perfect working order with no updates required.

astiob avatar Sep 19 '15 19:09 astiob

So there are 2 basic choices, a mechanism inside the tags, and one outside of the tags. It seems the latter has two advantages:

  • incompatible software will render them as artifacts, instead of silently dropping them, which makes it more apparent that there is a missing feature, instead of just a typo or a font problem
  • scripts which somehow strip tags (or something else weird?) will not accidentally remove them

So the \{ and \uXXXX proposals look preferable.

(Maybe this was all already said and discussed.)

ghost avatar Oct 02 '15 20:10 ghost