grunt-usemin
grunt-usemin copied to clipboard
Parse HTML instead of using regex
As many issues on Usemin show, parsing HTML with regex is prone to failure.
Would you accept PRs to switch Usemin to an HTML5 parser, or is there a reason that the project prefers regexes?
im curious what that would look like
@carols10cents ping.
Hi,
Sorry for the delay in answering. Yes, I would accept PRs: this is something I want to do since a moment now, but being caught by other activities :(
While this is not fixed you can use patterns
option, but make sure that you use regexp with multiple instance behaviour (/g
).
This should be advised in the README.md
IMHO.
usemin: {
html: ['<%= yeoman.dist %>/{,*/}*.html'],
css: ['<%= yeoman.dist %>/styles/{,*/}*.css'],
js: '<%= yeoman.dist %>/scripts/*.js',
options: {
assetsDirs: ['<%= yeoman.dist %>', '<%= yeoman.dist %>/images'],
patterns: {
// FIXME While usemin won't have full support for revved files we have to put all references manually here
js: [
[/(customer-icon\.svg)/g, 'Replacing reference to customer-icon.png'],
[/(iitail-logo\.svg)/g, 'Replacing reference to iitail-logo.png'],
[/(no_image\.png)/g, 'Replacing reference to no_image.png'],
[/(printer-iconx2\.png)/g, 'Replacing reference to printer-iconx2.png'],
[/(email-iconx2\.png)/g, 'Replacing reference to email-iconx2.png'],
[/(coins\.png)/g, 'Replacing reference to coins.png']
]
}
}
},
Thanks! :)
@eddiemonge Re, what this would look like: Instead of running regex on HTML we would use an HTML parser to build a document tree, and then would be able to read values with much more certainty. We could use a CSS parser like postCSS to do the same thing with CSS. JS and other templating languages would still need a patterns/regex matcher.
I'm not sure if this would speed up or slow down the task, but checking well defined object properties for a value, and if the value matches replacing would increase the reliability of parsing. A high percentage of the issues here are "Usemin misses this reference [because it's not in a place the regex expects]" or "Usemin chokes on this string", and I still regularly run into areas where Usemin chokes on some markup.
I think this would be a very good improvement. But it might be a big enough change that little of the original project is left. I wonder if it would be better done as an alternative project (which might take a while to mature) rather than as a swap-out-the-heart pull request on this one.
why an alternative project? i fully agree with @robwierzbowski this is the only future proof way to go for this project because regex causes so many problems and fails in variouse situations. parsing an html file is a core feature of usemin so this part should be absolute failsave and thats only possible using a full html parser and no regex.
@carols10cents and I were thinking of a simplified, more declarative workflow, and wanted to start from scratch. If @eddiemonge / other maintainers want to discuss, I'm sure we'd be interested in prototyping in separate packages and seeing if we could merge in here eventually.
But just FYI, probably not going to start work on it till later in the spring.
I'm moving my work in this space to gulp, so won't be contributing here. Leaving the issue open for anyone else that wants to pick up the torch.
cheerio is a module that gives you a jQuery API. Its much better suited for parsing html, than regex.
We hit a failure because we commented out one of the script tags in a build block. Never expected it to still parse and include the script. This caused a stitch-min only error for us, which is a bummer to debug.
I understand why this happened now that I understand this works on regex, but it does seem like an HTML-based design would be more robust and operate more predictably.
Very helpful Stack Overflow page, explains whether HTML should be parsed using regex:
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Just wanted to say, that this "bug" can be a "feature", too. You can comment out a script blog which shouldn't be loaded for local development, but will be included by usemin for the optimized version.
why would want to include a comment out file?
Say you have something which becomes a js file ony in the deploy process, but is something else (like an HTML file) within development. This is mostly true for templates. Some task can add these compiled templates to an existing usemin config like https://github.com/ericclemmons/grunt-angular-templates does, but not all plugins have this option.
Then why not create a plugin that has the option?
I'm rather curious what the rationale for using regex to parse HTML was in the first place — like, hasn't everyone seen this StackOverflow answer?
Another reason for changing the implementation to parse the DOM instead of using regex is because JavaScript regex can't repeat capturing groups, which makes it incredibly difficult to support something like the srcset
attribute, which can contain multiple series of filenames. See #428.
@aendrew srcset is a serious problem with the current architecture, but as you can see here this project currently lacks a maintainer. There is currently no one capable of making such decisions and rewrites. I think https://github.com/yeoman/grunt-usemin/issues/428 isn't the only tough problem. There is also the multiTarget problem (https://github.com/yeoman/grunt-usemin/issues/255) or the problem if you want to rev a file which hasn't changed, but contains references/links with revved files, so you need to re-rev it because of changed URLs. Besides that there are (imo) some API problems (like adding patterns without overwriting the default patterns). I think these problems could qualify a complete rewrite of grunt-usemin. (Just my opinion.)
My sense is that the time for usemin may be passing anyway, so finding another eager maintainer may be difficult and unnecessary. We have projects here still using grunt-usemin and gulp-usemin, but I believe a more modern approach is to automatically determine the list of files to be included using a clever file glob, a module system which understands files (like require) or add-on which uses the angular module system along with a module to file path naming convention to determine the full list of files to load.
@kylecordes I don't think it's just about determining a list of files to include. Usemin power IMHO is it's ability to replace and revved a whole repository full of links which is not a so easy task. HTML parsing should help solving some issues but it's just a part of the whole system.
I will look into using an HTML parser, but no promise. There are lots of issues which I think most of them are users lost in usemin or going the wrong way.
cheerio or jsdom might be a good option. they both depend on htmlparser2
@stevemao thanks for the info. Will look onto it.
thanks for the info. Will look onto it.
@stephanebachelier What's your progress in this?
@arthurvr still working on it, will come back soon.
Just to let you know that I'm working on this as a top priority.
oh thats so great to hear thanks @stephanebachelier
To all the dev
branch is a migration from regexps to an HTML Parser thanks to work from @marcelaraujo.
If anyone wants to try the dev
branch and create an issue. I will review all the linked issues and add some tests cases to be sure.
@stephanebachelier Will the dev branch be merged soon? I'm using v3.1.1 and still having the exact same issue.
@lvarayut Not sure about the time. But I will migrate an existing customer project on the dev branch in the next two weeks so expect something to happen soon.