grunt-usemin Parse HTML instead of using regex

As many issues on Usemin show, parsing HTML with regex is prone to failure.

Would you accept PRs to switch Usemin to an HTML5 parser, or is there a reason that the project prefers regexes?

Nov 20 '13 20:11 robwierzbowski

im curious what that would look like

Nov 20 '13 22:11 eddiemonge

@carols10cents ping.

Nov 20 '13 23:11 robwierzbowski

Hi,

Sorry for the delay in answering. Yes, I would accept PRs: this is something I want to do since a moment now, but being caught by other activities :(

Nov 21 '13 05:11 sleeper

While this is not fixed you can use patterns option, but make sure that you use regexp with multiple instance behaviour (/g).

This should be advised in the README.md IMHO.

    usemin: {
      html: ['<%= yeoman.dist %>/{,*/}*.html'],
      css: ['<%= yeoman.dist %>/styles/{,*/}*.css'],
      js: '<%= yeoman.dist %>/scripts/*.js',
      options: {
        assetsDirs: ['<%= yeoman.dist %>', '<%= yeoman.dist %>/images'],
        patterns: {
          // FIXME While usemin won't have full support for revved files we have to put all references manually here
          js: [
              [/(customer-icon\.svg)/g, 'Replacing reference to customer-icon.png'],
              [/(iitail-logo\.svg)/g, 'Replacing reference to iitail-logo.png'],
              [/(no_image\.png)/g, 'Replacing reference to no_image.png'],
              [/(printer-iconx2\.png)/g, 'Replacing reference to printer-iconx2.png'],
              [/(email-iconx2\.png)/g, 'Replacing reference to email-iconx2.png'],
              [/(coins\.png)/g, 'Replacing reference to coins.png']
          ]
        }
      }
    },

Thanks! :)

Dec 02 '13 16:12 marcalj

@eddiemonge Re, what this would look like: Instead of running regex on HTML we would use an HTML parser to build a document tree, and then would be able to read values with much more certainty. We could use a CSS parser like postCSS to do the same thing with CSS. JS and other templating languages would still need a patterns/regex matcher.

I'm not sure if this would speed up or slow down the task, but checking well defined object properties for a value, and if the value matches replacing would increase the reliability of parsing. A high percentage of the issues here are "Usemin misses this reference [because it's not in a place the regex expects]" or "Usemin chokes on this string", and I still regularly run into areas where Usemin chokes on some markup.

Jan 19 '14 01:01 robwierzbowski

I think this would be a very good improvement. But it might be a big enough change that little of the original project is left. I wonder if it would be better done as an alternative project (which might take a while to mature) rather than as a swap-out-the-heart pull request on this one.

Jan 19 '14 02:01 kylecordes

why an alternative project? i fully agree with @robwierzbowski this is the only future proof way to go for this project because regex causes so many problems and fails in variouse situations. parsing an html file is a core feature of usemin so this part should be absolute failsave and thats only possible using a full html parser and no regex.

Mar 24 '14 20:03 lhwparis

@carols10cents and I were thinking of a simplified, more declarative workflow, and wanted to start from scratch. If @eddiemonge / other maintainers want to discuss, I'm sure we'd be interested in prototyping in separate packages and seeing if we could merge in here eventually.

But just FYI, probably not going to start work on it till later in the spring.

Mar 24 '14 20:03 robwierzbowski

I'm moving my work in this space to gulp, so won't be contributing here. Leaving the issue open for anyone else that wants to pick up the torch.

Jun 23 '14 19:06 robwierzbowski

cheerio is a module that gives you a jQuery API. Its much better suited for parsing html, than regex.

Jun 25 '14 17:06 ralyodio

We hit a failure because we commented out one of the script tags in a build block. Never expected it to still parse and include the script. This caused a stitch-min only error for us, which is a bummer to debug.

I understand why this happened now that I understand this works on regex, but it does seem like an HTML-based design would be more robust and operate more predictably.

Jul 30 '14 17:07 SimplGy

Very helpful Stack Overflow page, explains whether HTML should be parsed using regex:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Jul 30 '14 20:07 kylecordes

Just wanted to say, that this "bug" can be a "feature", too. You can comment out a script blog which shouldn't be loaded for local development, but will be included by usemin for the optimized version.

Jul 31 '14 06:07 donaldpipowitch

why would want to include a comment out file?

Jul 31 '14 06:07 ralyodio

Say you have something which becomes a js file ony in the deploy process, but is something else (like an HTML file) within development. This is mostly true for templates. Some task can add these compiled templates to an existing usemin config like https://github.com/ericclemmons/grunt-angular-templates does, but not all plugins have this option.

Jul 31 '14 06:07 donaldpipowitch

Then why not create a plugin that has the option?

Jul 31 '14 16:07 eddiemonge

I'm rather curious what the rationale for using regex to parse HTML was in the first place — like, hasn't everyone seen this StackOverflow answer?

Another reason for changing the implementation to parse the DOM instead of using regex is because JavaScript regex can't repeat capturing groups, which makes it incredibly difficult to support something like the srcset attribute, which can contain multiple series of filenames. See #428.

Oct 20 '14 16:10 aendra-rininsland

@aendrew srcset is a serious problem with the current architecture, but as you can see here this project currently lacks a maintainer. There is currently no one capable of making such decisions and rewrites. I think https://github.com/yeoman/grunt-usemin/issues/428 isn't the only tough problem. There is also the multiTarget problem (https://github.com/yeoman/grunt-usemin/issues/255) or the problem if you want to rev a file which hasn't changed, but contains references/links with revved files, so you need to re-rev it because of changed URLs. Besides that there are (imo) some API problems (like adding patterns without overwriting the default patterns). I think these problems could qualify a complete rewrite of grunt-usemin. (Just my opinion.)

Oct 21 '14 06:10 donaldpipowitch

My sense is that the time for usemin may be passing anyway, so finding another eager maintainer may be difficult and unnecessary. We have projects here still using grunt-usemin and gulp-usemin, but I believe a more modern approach is to automatically determine the list of files to be included using a clever file glob, a module system which understands files (like require) or add-on which uses the angular module system along with a module to file path naming convention to determine the full list of files to load.

Oct 21 '14 14:10 kylecordes

@kylecordes I don't think it's just about determining a list of files to include. Usemin power IMHO is it's ability to replace and revved a whole repository full of links which is not a so easy task. HTML parsing should help solving some issues but it's just a part of the whole system.

I will look into using an HTML parser, but no promise. There are lots of issues which I think most of them are users lost in usemin or going the wrong way.

Oct 25 '14 01:10 stephanebachelier

cheerio or jsdom might be a good option. they both depend on htmlparser2

Nov 14 '14 02:11 stevemao

@stevemao thanks for the info. Will look onto it.

Nov 15 '14 01:11 stephanebachelier

thanks for the info. Will look onto it.

@stephanebachelier What's your progress in this?

Feb 19 '15 16:02 arthurvr

@arthurvr still working on it, will come back soon.

Feb 19 '15 18:02 stephanebachelier

Just to let you know that I'm working on this as a top priority.

Mar 11 '15 23:03 stephanebachelier

oh thats so great to hear thanks @stephanebachelier

Mar 11 '15 23:03 lhwparis

To all the dev branch is a migration from regexps to an HTML Parser thanks to work from @marcelaraujo. If anyone wants to try the dev branch and create an issue. I will review all the linked issues and add some tests cases to be sure.

Apr 13 '15 15:04 stephanebachelier

@stephanebachelier Will the dev branch be merged soon? I'm using v3.1.1 and still having the exact same issue.

Dec 24 '15 13:12 lvarayut

@lvarayut Not sure about the time. But I will migrate an existing customer project on the dev branch in the next two weeks so expect something to happen soon.

Dec 24 '15 13:12 stephanebachelier

grunt-usemin grunt-usemin copied to clipboard

Parse HTML instead of using regex

grunt-usemin
grunt-usemin copied to clipboard