minify
minify copied to clipboard
HTML: better support for template syntax
Hi,
I just noticed that if I have a "Type" field in my data struct and use minify for my go template files, it lowercases "Type".
So this:
<option value="0" {{ if eq .Type 0 }}selected{{ end }}>Foo</option>
Becomes this:
<option value=0 {{ if eq .type 0 }}selected{{ end }}>Foo</option>
And it seems like it's not possible to use text/template and minify for a javascript file together, does it? I know, that's probably quite an edge case :)
EDIT: Just tested this, doesn't work unfortunately, but I could still use minify as a package and minify the generates JS, so that's not too much of a problem for my case.
That's correct, the minifiers work according to the specs, and templates are really a syntax layer over original file format. You could make a preprocessor to skip everything between double brackets, but you have to somehow reinsert them after minifying. Like using a unique ID for replacement and afterwards substitute back.
I'm still thinking about a better solution, but it's not as easy as it looks...
Some thoughts I needed to write down:
It would be an awesome feature if the minifier (and then specifically the HTML minifier) could handle Go templates correctly (ignore {{ ... }}). But it would be even better to make that more general so it would ignore common constructs such as <?php ... ?>, <% ... %>, {{ ... }}, etc.
I don't really want to embed that in the HTML parser so it can stay spec compliant and doesn't get cluttered-up with exceptions. Rather there should be a construct where you can wrap the HTML minifier in a 'template-tag-ignorer' or the other way around, so that the user can decide which tags should be omitted but the design stays modular.
Not only should this added module omit template tags from the input (easy) but it should also write the ignored parts to the output at the right point because output often lags behind input (hard). The way I see it the options are:
- Add the template tag handling to the HTML parser and minifier (bad modularisation)
- Replace template tags by a unique symbol and replace those symbols with the original content after minification (slow and ugly)
- Somehow synchronize input and output so that whenever it ignores a template tag for the HTML parser it writes it to the output too (hard)
The problem with the last option is that it is hard because writing to output always lags behind reading from input, but do we know by how much? If we know that the template tag is at the front of the input buffer (not somewhere further on in the buffer due to peeking), than we can safely assume that we can immediately write to output. Does this hold when the template tag is in an attribute name or value?
Anyone any ideas?
Another problem is the following (awful) example: <span attr="data<?php echo 'data"'; ?>> resulting in <span attr="datadata">, but if you ignore the PHP part (which is what the HTML processor would do), it doesn't know that the attribute has ended. This is not solvable and probably rare, so it is a limitation (which is alright I think).
Shouldn't one minify the result of templates instead of templates to begin with? I think lots of optimization maybe missed if templates are minified instead of the actual compiled html.
Yes, I think so too especially since minification is so fast. But if you would minify templates you only have to minify once (and accept that the inserted HTML from the template will not be minified) and not after generation (typically per client request).
I don't think this one has a high priority, but still it would be a great feature nonetheless.
I don't understand why the minifier is lowercasing .Type in the original example. Surely that's a bug? If the file is an XML flavor of HTML, attributes are case-sensitive; if it's HTML5, custom data attributes are case sensitive.
XML flavor should be parsed by the XML minifier, the HTML minifier is for HTML5 only.
The HTML5 spec specifies that tags and attributes are case-insensitive. It also appears that data tags are case-insensitive when defining them. Accessing them through jQuery is case sensitive though, in the way that it has to be all lower-case or camel case. This means that the attribute definition is internally first converted to lower-case before being matched by jQuery. Lowering case improves gzip compression rates.
For example, try this in the W3 validator:
<!doctype html>
<html>
<head>
<title>test</title>
</head>
<body>
<div data-one="a" data-ONE="b"></div>
</body>
</html>
it will say
Error: Duplicate attribute data-one.
Commented out these 5 lines which fixed the issue for me as I don’t need lowercasing at all: https://github.com/tdewolff/parse/blob/master/util.go#L12-L16
Thanks for the amazing project!
@tdewolff sorry for commenting on a closed issue. What do you think about adding new option to Minifier? Something like DontLowercase bool it will be really helpful and would allow to write such a code
<html lang="{{.Lang}}">
That'll probably solve the problem. I'm going to look into supporting this soon!
@tdewolff can you please take a look at my PR?
I really don't like the name of an option DontLowercaseAttributes so maybe you'll pick a better one.
How about a multi-pass solution like this:
- Parse the original file with template parser into an AST
- Walk the AST to find the byte spans with "special" template logic, collect the spans into a big list
- Pass the original file plus the list of spans to the minifier
- The minifier trims bytes as if the indicated spans don't exist, but keeps track of where they would have gone. a. I suspect this can be done with basic (albeit tedious) byte counting
- When the minifier emits the minified html it re-inserts the template spans from the original file unmodified into the output at the right location
- The result is then fed back into the template parser and used like normal
The minifier API might look something like this: m.BytesPreserveSpans("text/html", b, []Span{{24,42},{106,301},...})
I think this may be the only "correct" way of doing it. If we're minifying just once at program startup it's not a big deal to do multiple passes.
Hey @infogulch , thanks for the effort! I believe the problem is a bit more difficult. For example, if your input is:
{{ $foobar := 'attr-value">' }}
<p attr="{{ print $foobar }} p-tag-content</p>
You'll get into trouble at the HTML parser anyways, as it'll believe that p-tag-content is part of the <p> attribute! This is not fixed with a multi-pass solution, which actually I believe can all happen in a single-pass anyways (also for performance considerations).
I've been thinking recently due to your post, that perhaps the best solution is adding a bit of template support to the HTML parser. It is a common enough problem to warrant the changes to the HTML parser, and it would actually be very little work. Principally, we could detect template tags ({{}}, <?php ?>, <% %>) only inside the content of attributes and the content of tags. I believe this is that the Go template language enforces anyways? When encountered, we could just skip ahead till the end of the template tag. I'm pretty confident that would solve 99% of the cases, what is your take on this?
As far as your first example goes, that's not valid for Go's html/template ( https://go.dev/play/p/pehr0IaMZTo
), and even attempting to solve that problem runs straight into the halting problem. In my opinion if your template code breaks across html syntax boundaries then there's no reasonable expectation for any other tool to be able to understand it, and you deserve whatever XSS hell you dig yourself into. 🤣😬
Yes I think some basic built-in template support would be a great idea. My only concern would be if actually supporting template syntax happens to be more complicated than we expect. Note that go's template allows overriding the template start/end tags {{, }} to whatever you want.
For inspiration, I’m now using html-minifier, which may be slower than this library, but it works excellently with my templates if using --ignore-custom-fragments '/{{[{]?.*?[}]?}}/'. Maybe you could consider something similar and put the responsibility on the users?
After 8 years I've finally come around to implement this, thanks for the patience!
In order to use this, you should try:
m.Add("text/html", &html.Minifier{
TemplateDelims: html.GoTemplateDelims,
})
That's great! If I'm reading correctly the strategy is to skip minifying syntax elements that contain the template delims, is that right?
Exactly, there isn't anything meaningful to do when templating gets involved, as anything can happen. Potentially we could minify text and attribute values surrounding the inner templates, but not sure if it's worth it...
You mean like:
<option value="0" {{ if eq .Type 0 }}selected style=" display: inline ; " {{ end }}>Foo</option>
<option value="0" {{ if eq .Type 0 }}selected style="display:inline;"{{ end }}>Foo</option>
Exactly, the output will be:
<option value=0 {{ if eq .type 0 }}selected style=display:inline {{ end }}>Foo
That is, the selected isn't minified, the style is (the parser doesn't look at the content of the templates, and is template language agnostic)
we could minify text and attribute values surrounding the inner templates
To clarify, you mean that content ~1 token away from a template section is skipped and not minified.
In this case:
<option value="0" {{ if eq .Type 0 }}selected style=" color: green ; " {{ end }}>Foo</option>
... is minified to:
<option value=0 {{ if eq .Type 0 }}selected style=color:green {{ end }}>Foo
In particular, selected is not minified because it's immediately adjacent to a template section (though there's nothing to minify in this case), and just after the style retains one space token because there was a space in the original text. This seems reasonable to me. (Not to talk in circles, I'm just trying to understand.)
What about a case with indentation whitespace?
<ol id="feeds-list">
<li><a href="/">All Feeds</a></li>
{{- range .Feeds}}
{{- block "feed" .}}
<li>
<a href="/feed/{{.id}}/">
{{- if .image}}<img src="{{.image}}">{{- else}}<span>{{.title | substr 0 1 | upper}}</span>{{end}}
{{- .title -}}
</a>
</li>
{{- end}}
{{- end}}
</ol>
I'm guessing it will just collapse the LF+indentation into just LF, like this:
<ol id="feeds-list">
<li><a href="/">All Feeds</a></li>
{{- range .Feeds}}
{{- block "feed" .}}
<li>
<a href="/feed/{{.id}}/">
{{- if .image}}<img src="{{.image}}">{{- else}}<span>{{.title | substr 0 1 | upper}}</span>{{end}}
{{- .title -}}
</a>
</li>
{{- end}}
{{- end}}
</ol>
(Of course Go's use of - to strip whitespace next to a template section would complicate the final rendered output, but I think we can focus on what minify does to the template text before being parsed by html/template.)
You are correct. Basically <option value="0" {{ if eq .Type 0 }}selected style=" color: green ; " {{ end }}>Foo</option> is seen as <option value="0" {{***}}selected style=" color: green ; " {{***}}>Foo</option>, and thus we do minify the style attribute.
In your second example, it will disable minifying when the content has a template, so <a href="/feed/{{.id}}/"> Click here {{.title | substr 0 1 | upper}} </a> and <script> var a = '{{.}}';</script> will both keep their content as-is.
What I meant was that we could replace {{***}} with some unique identifier, minify, and replace the identifier with the original template. This would allow us to minify <script> var a = '{{.}}';</script> to e.g. <script> var a = 'OWHJNM728';</script> => <script>var a='OWHJNM728'</script> => <script>var a='{{.}}'</script>. This is a bit complicate however...
Thank you for clarifying, that all makes sense.
I agree that replacing with an identifier could be a bit messy. It might be cleaner if you replace it with a single character like A and kept track of the positions of deleted characters (I think minify only deletes characters, is that right?), then project through the relocations to replace back. This might be more accurate, though probably not less complicated.
True, or use the \uFFFC Unicode object replacement character for that. The problem is that after replacing the templates e.g. in a piece of JavaScript, we then send it to the JS minifier which expect valid JS. It isn't trivial to create a unique identifier or replacement character that results in valid JS and cannot collide with existing definitions in the script. The same for any other mimetype/minifier...