liquid
liquid copied to clipboard
[Proposal] Scanning Tokenizer with Improved String support
The commit currently isn't the final version but is showing the working version. The proposal is to shift the Tokenizer to be a scanner that identifies the following tokens.
{% - Tag Open
%} - Tag Close
{{ - Variable Open
}} - Variable Close
} - Variable Incomplete Close
' - Single Quote
" - Double Quote
Because of this the scanner only ever needs to look at 2 positions and can operate in a single pass.
It then has the following states to control flow
0 - Outside Any Tag or Variable
1 - Inside a Tag
2 - Inside a Variable
3 - Inside a Tag and Single Quotes
4 - Inside a Tag and Double Quotes
5 - Inside a Variable and Single Quotes
6 - Inside a Variable and Double Quotes
0,1,2 - Not Inside Quotes
The pseudo-code is as follows
if 'Found Tag Open' && 'Not Inside Quotes'
State is now 'Inside Tag'
End Token
Start Token
Append to Token
else if 'Found Variable Open' && 'Not Inside Quotes'
State is now 'Inside Variable'
End Token
Start Token
Append to Token
else if 'Found Single Quote' && 'Inside Tag'
State is now 'Inside Tag and Single Quotes'
Append to Token
else if 'Found Single Quote' && 'Inside Tag and Single Quotes'
State is now 'Inside Tag'
Append to Token
else if 'Found Double Quote' && 'Inside Tag'
State is now 'Inside Tag and Double Quotes'
Append to Token
else if 'Found Double Quote' && 'Inside Tag and Double Quotes'
State is now 'Inside Tag'
Append to Token
else if 'Found Single Quote' && 'Inside Variable'
State is now 'Inside Variable and Single Quotes'
Append to Token
else if 'Found Single Quote' && 'Inside Variable and Single Quotes'
State is now 'Inside Variable'
Append to Token
else if 'Found Double Quote' && 'Inside Variable'
State is now 'Inside Variable and Double Quotes'
Append to Token
else if 'Found Double Quote' && 'Inside Variable and Double Quotes'
State is now 'Inside Variable'
Append to Token
else if 'Found Tag Close' && 'Inside Tag'
State is now 'Outside Any Tag or Variable'
Append to Token
End Token
Start Token
else if 'Found Variable Close' && 'Inside Variable'
State is now 'Outside Any Tag or Variable'
Append to Token
End Token
Start Token
else
Append to Token
end
This resolves and has the tests from the following PRs and Issues
Closes #701 Closes #779 Closes #624 Closes #623 Closes #344 Closes #213
Will need matching PR for liquid-c
and improvements to this ruby version but the concept is easily implemented in both.
@Shopify/guardians-of-the-liquid @Shopify/liquid
This approach seems to have increased memory usage for the parse
phase significantly:
+-----------------+-------------------------+--------------------------+
| Phase | Parse | Render |
+-----------------+-------------------------+--------------------------+
- | Total allocated | 4.53 MB (53197 objects) | 979.68 kB (8827 objects) |
+ | Total allocated | 7.23 MB (88561 objects) | 979.68 kB (8827 objects) |
| Total retained | 0 B (0 objects) | 49.70 kB (276 objects) |
+-----------------+-------------------------+--------------------------+
Oh yeah, performance for this is likely, not great at the moment. I was just focusing on getting it to pass. This has the ability to be highly efficient though so just needs to be optimised.
Great to know! I wasn't complaining. Just thought I'd post it so that you'd have it in the back of your head while you develop the concept further.
This regex wouldn't exist either, needs to be replaced by a scanner. Just a quick hack to get this moving
source.split(/({%|{{|"|'|}}|%}|})/om).each
And the conditional statements are double checking things which should be optimised
Plus the point of all this is to achieve that the following code is valid (basically curly brackets inside strings that are inside tags)
{{ variable | prepend: '{' | append: '}' }}
and
{{ 'blah {{ yy }}' | replace: '{{', 'xx' }}
I'm also starting to wonder if we should solve the few following issues as these can be fixed with a few more tokens to scan for
Allow escaping in " so that it is possible to use both a single quote and double quote in a string, which is not currently possible
{% assign = "we can handle \"double quotes\" and 'single quotes'" %}
This does not break backwards support for the tokenizer as double quotes are not currently possible inside double quotes. However it will break string detection inside tags
The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag
{% liquid
assign = "we can handle\n new lines"
%}
This can possibly break templates that use "\n" in a string today. The liquid.format() command could be a way to migrate templates without issues.
Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Seperated Values
{% assign = "we can handle returns\r and tabs \t" %}
This can possibly break templates that use "\r" or "\t" in a string today. The liquid.format() command could be a way to migrate templates without issues.
Finally because we support the above we have to handle escaping escapes
{% assign = "we can handle a real backslash next to the character n \\n we are not on a newline" %}
This can possibly break templates that use "\" before a control character in a string today. The liquid.format() command could be a way to migrate templates without issues.
An alternative proposal for the changes above is to bring in a third quoting method using backticks
Allow escaping in ` so that it is possible to use both a single quote and double quote, and back ticks in a string, which is not currently possible
{% assign = `we can handle "double quotes" and 'single quotes' and \`back ticks\`` %}
This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated
The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag
{% liquid
assign = `we can handle\n new lines`
%}
This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated
Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Separated Values
{% assign = `we can handle returns\r and tabs \t` %}
This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated
Finally because we support the above we have to handle escaping escapes
{% assign = `we can handle a real backslash next to the character n \\n we are not on a newline` %}
This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated