minify icon indicating copy to clipboard operation
minify copied to clipboard

HTML: better support for template syntax

Open stroborobo opened this issue 10 years ago • 10 comments

Hi,

I just noticed that if I have a "Type" field in my data struct and use minify for my go template files, it lowercases "Type".

So this:

<option value="0" {{ if eq .Type 0 }}selected{{ end }}>Foo</option>

Becomes this:

<option value=0 {{ if eq .type 0 }}selected{{ end }}>Foo</option>

And it seems like it's not possible to use text/template and minify for a javascript file together, does it? I know, that's probably quite an edge case :)

EDIT: Just tested this, doesn't work unfortunately, but I could still use minify as a package and minify the generates JS, so that's not too much of a problem for my case.

stroborobo avatar Jul 27 '15 10:07 stroborobo

That's correct, the minifiers work according to the specs, and templates are really a syntax layer over original file format. You could make a preprocessor to skip everything between double brackets, but you have to somehow reinsert them after minifying. Like using a unique ID for replacement and afterwards substitute back.

I'm still thinking about a better solution, but it's not as easy as it looks...

tdewolff avatar Aug 10 '15 07:08 tdewolff

Some thoughts I needed to write down:

It would be an awesome feature if the minifier (and then specifically the HTML minifier) could handle Go templates correctly (ignore {{ ... }}). But it would be even better to make that more general so it would ignore common constructs such as <?php ... ?>, <% ... %>, {{ ... }}, etc.

I don't really want to embed that in the HTML parser so it can stay spec compliant and doesn't get cluttered-up with exceptions. Rather there should be a construct where you can wrap the HTML minifier in a 'template-tag-ignorer' or the other way around, so that the user can decide which tags should be omitted but the design stays modular.

Not only should this added module omit template tags from the input (easy) but it should also write the ignored parts to the output at the right point because output often lags behind input (hard). The way I see it the options are:

  • Add the template tag handling to the HTML parser and minifier (bad modularisation)
  • Replace template tags by a unique symbol and replace those symbols with the original content after minification (slow and ugly)
  • Somehow synchronize input and output so that whenever it ignores a template tag for the HTML parser it writes it to the output too (hard)

The problem with the last option is that it is hard because writing to output always lags behind reading from input, but do we know by how much? If we know that the template tag is at the front of the input buffer (not somewhere further on in the buffer due to peeking), than we can safely assume that we can immediately write to output. Does this hold when the template tag is in an attribute name or value?

Anyone any ideas?

Another problem is the following (awful) example: <span attr="data<?php echo 'data"'; ?>> resulting in <span attr="datadata">, but if you ignore the PHP part (which is what the HTML processor would do), it doesn't know that the attribute has ended. This is not solvable and probably rare, so it is a limitation (which is alright I think).

tdewolff avatar Jan 28 '16 09:01 tdewolff

Shouldn't one minify the result of templates instead of templates to begin with? I think lots of optimization maybe missed if templates are minified instead of the actual compiled html.

omeid avatar Jan 28 '16 10:01 omeid

Yes, I think so too especially since minification is so fast. But if you would minify templates you only have to minify once (and accept that the inserted HTML from the template will not be minified) and not after generation (typically per client request).

I don't think this one has a high priority, but still it would be a great feature nonetheless.

tdewolff avatar Jan 28 '16 11:01 tdewolff

I don't understand why the minifier is lowercasing .Type in the original example. Surely that's a bug? If the file is an XML flavor of HTML, attributes are case-sensitive; if it's HTML5, custom data attributes are case sensitive.

lpar avatar Apr 28 '16 02:04 lpar

XML flavor should be parsed by the XML minifier, the HTML minifier is for HTML5 only.

The HTML5 spec specifies that tags and attributes are case-insensitive. It also appears that data tags are case-insensitive when defining them. Accessing them through jQuery is case sensitive though, in the way that it has to be all lower-case or camel case. This means that the attribute definition is internally first converted to lower-case before being matched by jQuery. Lowering case improves gzip compression rates.

For example, try this in the W3 validator:

<!doctype html>
<html>
<head>
<title>test</title>
</head>
<body>
  <div data-one="a" data-ONE="b"></div>
</body>
</html>

it will say

Error: Duplicate attribute data-one.

tdewolff avatar Apr 28 '16 07:04 tdewolff

Commented out these 5 lines which fixed the issue for me as I don’t need lowercasing at all: https://github.com/tdewolff/parse/blob/master/util.go#L12-L16

Thanks for the amazing project!

zengabor avatar May 03 '18 13:05 zengabor

@tdewolff sorry for commenting on a closed issue. What do you think about adding new option to Minifier? Something like DontLowercase bool it will be really helpful and would allow to write such a code

<html lang="{{.Lang}}">

alex-bacart avatar Jun 15 '20 00:06 alex-bacart

That'll probably solve the problem. I'm going to look into supporting this soon!

tdewolff avatar Jun 15 '20 00:06 tdewolff

@tdewolff can you please take a look at my PR? I really don't like the name of an option DontLowercaseAttributes so maybe you'll pick a better one.

alex-bacart avatar Jul 04 '20 15:07 alex-bacart

How about a multi-pass solution like this:

  1. Parse the original file with template parser into an AST
  2. Walk the AST to find the byte spans with "special" template logic, collect the spans into a big list
  3. Pass the original file plus the list of spans to the minifier
  4. The minifier trims bytes as if the indicated spans don't exist, but keeps track of where they would have gone. a. I suspect this can be done with basic (albeit tedious) byte counting
  5. When the minifier emits the minified html it re-inserts the template spans from the original file unmodified into the output at the right location
  6. The result is then fed back into the template parser and used like normal

The minifier API might look something like this: m.BytesPreserveSpans("text/html", b, []Span{{24,42},{106,301},...})

I think this may be the only "correct" way of doing it. If we're minifying just once at program startup it's not a big deal to do multiple passes.

infogulch avatar Jul 24 '23 05:07 infogulch

Hey @infogulch , thanks for the effort! I believe the problem is a bit more difficult. For example, if your input is:

{{ $foobar := 'attr-value">' }}
<p attr="{{ print $foobar }} p-tag-content</p>

You'll get into trouble at the HTML parser anyways, as it'll believe that p-tag-content is part of the <p> attribute! This is not fixed with a multi-pass solution, which actually I believe can all happen in a single-pass anyways (also for performance considerations).

I've been thinking recently due to your post, that perhaps the best solution is adding a bit of template support to the HTML parser. It is a common enough problem to warrant the changes to the HTML parser, and it would actually be very little work. Principally, we could detect template tags ({{}}, <?php ?>, <% %>) only inside the content of attributes and the content of tags. I believe this is that the Go template language enforces anyways? When encountered, we could just skip ahead till the end of the template tag. I'm pretty confident that would solve 99% of the cases, what is your take on this?

tdewolff avatar Jul 27 '23 10:07 tdewolff

As far as your first example goes, that's not valid for Go's html/template ( https://go.dev/play/p/pehr0IaMZTo ), and even attempting to solve that problem runs straight into the halting problem. In my opinion if your template code breaks across html syntax boundaries then there's no reasonable expectation for any other tool to be able to understand it, and you deserve whatever XSS hell you dig yourself into. 🤣😬

Yes I think some basic built-in template support would be a great idea. My only concern would be if actually supporting template syntax happens to be more complicated than we expect. Note that go's template allows overriding the template start/end tags {{, }} to whatever you want.

infogulch avatar Jul 27 '23 18:07 infogulch

For inspiration, I’m now using html-minifier, which may be slower than this library, but it works excellently with my templates if using --ignore-custom-fragments '/{{[{]?.*?[}]?}}/'. Maybe you could consider something similar and put the responsibility on the users?

zengabor avatar Jul 28 '23 16:07 zengabor

After 8 years I've finally come around to implement this, thanks for the patience!

In order to use this, you should try:

m.Add("text/html", &html.Minifier{
    TemplateDelims: html.GoTemplateDelims,
})

tdewolff avatar Oct 30 '23 20:10 tdewolff

That's great! If I'm reading correctly the strategy is to skip minifying syntax elements that contain the template delims, is that right?

infogulch avatar Oct 30 '23 21:10 infogulch

Exactly, there isn't anything meaningful to do when templating gets involved, as anything can happen. Potentially we could minify text and attribute values surrounding the inner templates, but not sure if it's worth it...

tdewolff avatar Oct 30 '23 21:10 tdewolff

You mean like:

<option value="0" {{ if eq .Type 0 }}selected   style="  display:   inline ; "   {{ end }}>Foo</option>

<option value="0" {{ if eq .Type 0 }}selected style="display:inline;"{{ end }}>Foo</option>

infogulch avatar Oct 30 '23 21:10 infogulch

Exactly, the output will be:

<option value=0 {{ if eq .type 0 }}selected style=display:inline {{ end }}>Foo

That is, the selected isn't minified, the style is (the parser doesn't look at the content of the templates, and is template language agnostic)

tdewolff avatar Oct 30 '23 21:10 tdewolff

we could minify text and attribute values surrounding the inner templates

To clarify, you mean that content ~1 token away from a template section is skipped and not minified.

In this case:

<option value="0" {{ if eq .Type 0 }}selected   style="  color:   green ; "   {{ end }}>Foo</option>

... is minified to:

<option value=0 {{ if eq .Type 0 }}selected style=color:green {{ end }}>Foo

In particular, selected is not minified because it's immediately adjacent to a template section (though there's nothing to minify in this case), and just after the style retains one space token because there was a space in the original text. This seems reasonable to me. (Not to talk in circles, I'm just trying to understand.)

What about a case with indentation whitespace?

		<ol id="feeds-list">
			<li><a href="/">All Feeds</a></li>
			{{- range .Feeds}}
			{{- block "feed" .}}
			<li>
				<a href="/feed/{{.id}}/">
					{{- if .image}}<img src="{{.image}}">{{- else}}<span>{{.title | substr 0 1 | upper}}</span>{{end}}
					{{- .title -}}
				</a>
			</li>
			{{- end}}
			{{- end}}
		</ol>

I'm guessing it will just collapse the LF+indentation into just LF, like this:

<ol id="feeds-list">
<li><a href="/">All Feeds</a></li>
{{- range .Feeds}}
{{- block "feed" .}}
<li>
<a href="/feed/{{.id}}/">
{{- if .image}}<img src="{{.image}}">{{- else}}<span>{{.title | substr 0 1 | upper}}</span>{{end}}
{{- .title -}}
</a>
</li>
{{- end}}
{{- end}}
</ol>

(Of course Go's use of - to strip whitespace next to a template section would complicate the final rendered output, but I think we can focus on what minify does to the template text before being parsed by html/template.)

infogulch avatar Oct 30 '23 21:10 infogulch

You are correct. Basically <option value="0" {{ if eq .Type 0 }}selected style=" color: green ; " {{ end }}>Foo</option> is seen as <option value="0" {{***}}selected style=" color: green ; " {{***}}>Foo</option>, and thus we do minify the style attribute.

In your second example, it will disable minifying when the content has a template, so <a href="/feed/{{.id}}/"> Click here {{.title | substr 0 1 | upper}} </a> and <script> var a = '{{.}}';</script> will both keep their content as-is.

What I meant was that we could replace {{***}} with some unique identifier, minify, and replace the identifier with the original template. This would allow us to minify <script> var a = '{{.}}';</script> to e.g. <script> var a = 'OWHJNM728';</script> => <script>var a='OWHJNM728'</script> => <script>var a='{{.}}'</script>. This is a bit complicate however...

tdewolff avatar Oct 31 '23 14:10 tdewolff

Thank you for clarifying, that all makes sense.

I agree that replacing with an identifier could be a bit messy. It might be cleaner if you replace it with a single character like A and kept track of the positions of deleted characters (I think minify only deletes characters, is that right?), then project through the relocations to replace back. This might be more accurate, though probably not less complicated.

infogulch avatar Oct 31 '23 22:10 infogulch

True, or use the \uFFFC Unicode object replacement character for that. The problem is that after replacing the templates e.g. in a piece of JavaScript, we then send it to the JS minifier which expect valid JS. It isn't trivial to create a unique identifier or replacement character that results in valid JS and cannot collide with existing definitions in the script. The same for any other mimetype/minifier...

tdewolff avatar Nov 02 '23 14:11 tdewolff