liquid icon indicating copy to clipboard operation
liquid copied to clipboard

[Proposal] Scanning Tokenizer with Improved String support

Open shopmike opened this issue 5 years ago • 8 comments

The commit currently isn't the final version but is showing the working version. The proposal is to shift the Tokenizer to be a scanner that identifies the following tokens.

    {% - Tag Open
    %} - Tag Close
    {{ - Variable Open 
    }} - Variable Close
    }  - Variable Incomplete Close
    '  - Single Quote
    "  - Double Quote

Because of this the scanner only ever needs to look at 2 positions and can operate in a single pass.

It then has the following states to control flow

    0 - Outside Any Tag or Variable
    1 - Inside a Tag
    2 - Inside a Variable
    3 - Inside a Tag and Single Quotes
    4 - Inside a Tag and Double Quotes
    5 - Inside a Variable and Single Quotes
    6 - Inside a Variable and Double Quotes
0,1,2 - Not Inside Quotes

The pseudo-code is as follows

    if 'Found Tag Open' && 'Not Inside Quotes'
        State is now 'Inside Tag'
        End Token
        Start Token
        Append to Token

    else if 'Found Variable Open' && 'Not Inside Quotes'
        State is now 'Inside Variable'
        End Token
        Start Token
        Append to Token

    else if 'Found Single Quote' && 'Inside Tag'
        State is now 'Inside Tag and Single Quotes'
        Append to Token

    else if 'Found Single Quote' && 'Inside Tag and Single Quotes'
        State is now 'Inside Tag'
        Append to Token

    else if 'Found Double Quote' && 'Inside Tag'
        State is now 'Inside Tag and Double Quotes'
        Append to Token

    else if 'Found Double Quote' && 'Inside Tag and Double Quotes'
        State is now 'Inside Tag'
        Append to Token

    else if 'Found Single Quote' && 'Inside Variable'
        State is now 'Inside Variable and Single Quotes'
        Append to Token

    else if 'Found Single Quote' && 'Inside Variable and Single Quotes'
        State is now 'Inside Variable'
        Append to Token

    else if 'Found Double Quote' && 'Inside Variable'
        State is now 'Inside Variable and Double Quotes'
        Append to Token

    else if 'Found Double Quote' && 'Inside Variable and Double Quotes'
        State is now 'Inside Variable'
        Append to Token

    else if 'Found Tag Close' && 'Inside Tag'
        State is now 'Outside Any Tag or Variable'
        Append to Token
        End Token
        Start Token

    else if 'Found Variable Close' && 'Inside Variable'
        State is now 'Outside Any Tag or Variable'
        Append to Token
        End Token
        Start Token

    else
        Append to Token
    end

This resolves and has the tests from the following PRs and Issues

Closes #701 Closes #779 Closes #624 Closes #623 Closes #344 Closes #213

Will need matching PR for liquid-c and improvements to this ruby version but the concept is easily implemented in both.

@Shopify/guardians-of-the-liquid @Shopify/liquid

shopmike avatar Sep 20 '19 06:09 shopmike

This approach seems to have increased memory usage for the parse phase significantly:

  +-----------------+-------------------------+--------------------------+
  | Phase           | Parse                   | Render                   |
  +-----------------+-------------------------+--------------------------+
- | Total allocated | 4.53 MB (53197 objects) | 979.68 kB (8827 objects) |
+ | Total allocated | 7.23 MB (88561 objects) | 979.68 kB (8827 objects) |
  | Total retained  | 0 B (0 objects)         | 49.70 kB (276 objects)   |
  +-----------------+-------------------------+--------------------------+

ashmaroli avatar Sep 20 '19 07:09 ashmaroli

Oh yeah, performance for this is likely, not great at the moment. I was just focusing on getting it to pass. This has the ability to be highly efficient though so just needs to be optimised.

shopmike avatar Sep 20 '19 08:09 shopmike

Great to know! I wasn't complaining. Just thought I'd post it so that you'd have it in the back of your head while you develop the concept further.

ashmaroli avatar Sep 20 '19 08:09 ashmaroli

This regex wouldn't exist either, needs to be replaced by a scanner. Just a quick hack to get this moving

source.split(/({%|{{|"|'|}}|%}|})/om).each

shopmike avatar Sep 20 '19 09:09 shopmike

And the conditional statements are double checking things which should be optimised

shopmike avatar Sep 20 '19 09:09 shopmike

Plus the point of all this is to achieve that the following code is valid (basically curly brackets inside strings that are inside tags)

{{ variable | prepend: '{' | append: '}' }}

and

{{ 'blah {{ yy }}' | replace: '{{', 'xx' }}

shopmike avatar Sep 20 '19 10:09 shopmike

I'm also starting to wonder if we should solve the few following issues as these can be fixed with a few more tokens to scan for

Allow escaping in " so that it is possible to use both a single quote and double quote in a string, which is not currently possible

{% assign = "we can handle \"double quotes\" and 'single quotes'" %}

This does not break backwards support for the tokenizer as double quotes are not currently possible inside double quotes. However it will break string detection inside tags

The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag

{% liquid 
    assign = "we can handle\n new lines" 
%}

This can possibly break templates that use "\n" in a string today. The liquid.format() command could be a way to migrate templates without issues.

Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Seperated Values

{% assign = "we can handle returns\r and tabs \t" %}

This can possibly break templates that use "\r" or "\t" in a string today. The liquid.format() command could be a way to migrate templates without issues.

Finally because we support the above we have to handle escaping escapes

{% assign = "we can handle a real backslash next to the character n \\n we are not on a newline" %}

This can possibly break templates that use "\" before a control character in a string today. The liquid.format() command could be a way to migrate templates without issues.

shopmike avatar Sep 23 '19 11:09 shopmike

An alternative proposal for the changes above is to bring in a third quoting method using backticks

Allow escaping in ` so that it is possible to use both a single quote and double quote, and back ticks in a string, which is not currently possible

{% assign = `we can handle "double quotes" and 'single quotes' and \`back ticks\`` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag

{% liquid 
    assign = `we can handle\n new lines`
%}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Separated Values

{% assign = `we can handle returns\r and tabs \t` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

Finally because we support the above we have to handle escaping escapes

{% assign = `we can handle a real backslash next to the character n \\n we are not on a newline` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

shopmike avatar Sep 23 '19 11:09 shopmike