markdownlint-cli icon indicating copy to clipboard operation
markdownlint-cli copied to clipboard

Suboptimal performance when a large folder is listed in .markdownlintignore

Open kachkaev opened this issue 5 years ago • 17 comments

I have a project with a local data folder, which is listed in all .*ignore files. When this temp folder gets quite large (more than 10K files), Markdownlint seems to become much slower than other linting tools (Prettier and ESLint). This suggests that the ignore logic could be improved.

Here are the reproduction steps for macOS Catalina and zsh. I only compare Markdownlint with Prettier here for simplicity, but according to my observations, ESLint also remains quite performant.

  1. Init a project with README.md , Prettier, Markdownlint and identical .prettierignore + .markdownlintignore files.

    mkdir /tmp/slow-markdownlint-repro && cd "$_"
    
    yarn init --yes
    yarn add markdownlint-cli prettier
    
    cat <<EOF >README.md
    # Hello world
    
    test document
    EOF
    
    cat <<EOF >.prettierignore
    *.*
    !*.md
    
    my-big-temp-folder
    node_modules
    EOF
    
    cp .prettierignore .markdownlintignore
    
  2. Run Prettier and Markdownlint and observe reasonable performance.

    yarn prettier --check "**/*"
    ## a couple of secs
    
    yarn markdownlint "**/*"
    ## a couple of secs
    
  3. Create a large folder, which is already listed in the ignore files.

    cp -r $(yarn cache dir) my-big-temp-folder
    
    du -sh my-big-temp-folder
    ## 4.2G in my case 👀
    
  4. Run Prettier and Markdownlint again.

    yarn prettier --check "**/*"
    ## still a couple of secs
    
    yarn markdownlint "**/*"
    ## 115 secs in my case 👈
    
  5. Delete the project to release disk space.

    cd ~
    rm -rf /tmp/slow-markdownlint-repro
    

It feels like Markdownlint CLI still lists all the files in the ignored folder and then matches each path against the ignore list, while Prettier does not even look into my-big-temp-folder.

I don’t know Markdownlint internals well enough to submit a PR, but I hope that my repro steps may help with the investigation 🙌

I believe that having large local data / var / temp folders is not that rare in software development, so I won’t be the only person who would benefit from performance improvements 🙂

kachkaev avatar Jul 18 '20 17:07 kachkaev

This is great detail, thank you very much! I‘be been working on an experiment that I’d like you to try in a week or so that may help with this. I’ll follow up here when it’s ready.

DavidAnson avatar Jul 18 '20 18:07 DavidAnson

I've published something new: https://github.com/DavidAnson/markdownlint-cli2

It's a slightly different take on a CLI, please see the README for context.

For your purposes, something like the following should be equivalent to what you have today. But faster (according to my own limited testing).

Could you please give it a try and let me know what you find?

.markdownlintignore:

my-big-temp-folder
node_modules

Command-line:

markdownlint-cli2 "**/*.md"

Aside: The pattern you show of including all files and then excluding Markdown files in the ignore file is not supported by CLI2 – and kind of weird anyway. I wouldn't guess it's doing much to help with performance, though I agree that in the abstract it is equivalent to what I show above.

DavidAnson avatar Jul 21 '20 01:07 DavidAnson

By the way, you could put the two ignore paths on the command-line, but I was trying to match your current configuration as closely as possible.

DavidAnson avatar Jul 21 '20 02:07 DavidAnson

Hey David, thanks for the follow up! I’ll try testing the new CLI performance over the weekend, fingers crossed!

I understand that the .markdownlintignore approach I’ve shared may look a bit odd, but it has its reasons. Excluding all files and then allowing certain extensions helps configure several linting tools consistenly. For me, this includes ESLint, Prettier and also Markdownlint since recently. By moving all include patterns into the corresponding ignore file, I can forget about providing the right arguments to CLI and also not worry about editor integration. This reduces repetition of complex globs and thus the probability of human errors.

If for Markdownlint the rule is simply to include *.md(x) files, for other tools the CLI glob would be way more sophisticated. Support for multiple extensions is just one source of complexity, we also need to account for build folders, git-ignored paths and so on. In package.json, there are linting and fixing scripts as well as pre-commit hooks – all three require some kind of globs. Besides, there is an IDE, which would need its own tweaking if the 'source of truth' for ignoring and including files is in CLI args and not in a corresponding .*ignore file.

See this project as an example: https://github.com/kachkaev/njt. It is quite small so does not suffer from CLI performance issues, but it has a full-blown config for the linters. Hope it helps illustrate the coherence being achieved with the described approach to the .*ignore files.

kachkaev avatar Jul 21 '20 12:07 kachkaev

That reasoning makes sense, but interferes with what I’m trying to do which is represent all of the positive and negative pattern matching on the command line so it can be passed into the glob tool for best performance. As you suggested above, the slow implementation of ignore is to apply the ignore file rules with a filter after using the globs to search the disk. I looked briefly at ESLint and it seems that’s what they do, though they apply some default ignore rules at glob time to filter out, for example, the node_modules directory which is probably why it’s not also slow. Though I didn’t analyze it in depth, allowing negative patterns in the ignore file didn’t seem to work with this approach because they are supposed to be scoped to just the ignore operation (“!*.md” in an ignore file does not mean the same as “*.md” on the command line). I may be wrong about this or you may point out a way to handle it better. I’d be happy to consider either! But for now, this compromise seemed to have a fairly minor effect on users but a potentially significant effect on performance and so that’s what I’ve gone with.

DavidAnson avatar Jul 21 '20 15:07 DavidAnson

FYI, I've just published version 0.0.3 of markdownlint-cli2. It now supports everything markdownlint-cli does (though differently). The globbing behavior discussed above is not changed vs. version 0.0.2 - did you get a chance to try that out for your scenario?

DavidAnson avatar Jul 31 '20 21:07 DavidAnson

I've got a .markdownlintignore file with this contents:

node_modules
/prototypes

I've got 35 *.md files and running plain markdownlint in the folder takes about half a minute.

I tried constructing the command differently, not depending on .markdownlintignore, and am betting sub-second time:

npx markdownlint -i prototypes $(fd -g "*.md" -E "prototypes" . | tr '\n' ' ')

Is this the same issue reported here?

borekb avatar Aug 09 '20 18:08 borekb

@borekb Yes, seems like the same thing. However, both approaches should be fast in markdownlint-cli2. If you confirm or refute that, please let me know.

DavidAnson avatar Aug 09 '20 19:08 DavidAnson

Thanks, I'll try to experiment with it sometimes soon.

borekb avatar Aug 09 '20 19:08 borekb

@DavidAnson I can confirm that cli2 doesn't have this problem 🎉.

borekb avatar Aug 13 '20 15:08 borekb

I ran my reproduction steps, but replace markdownlint with markdownlint-cli2. Running yarn markdownlint-cli2 "**/*" took less than a second instead of >100 seconds! 🎉 Many thanks for your work @DavidAnson, v2 looks promising! 🚀

kachkaev avatar Aug 16 '20 14:08 kachkaev

I can also confirm markdownlint-cli2 is way faster than markdownlint, great job @DavidAnson! ❤️

~~So what will happen to the original CLI now? Are you going to deprecate it? Also calling the new CLI via markdownlint-cli2 seems a bit unusual compared to other npm scripts…~~

Edit: I now realized the original CLI is from different author, sorry 🤦.

adamkudrna avatar Aug 21 '20 16:08 adamkudrna

@adamkudrna Thanks for confirming that! My motivation for doing CLI2 was that there were a number of features/improvements that I didn't see a clean way of adding to CLI1. I will continue updating the markdownlint dependency and keep the tests passing and review pull requests for CLI1, but will probably dedicate most of my time on features/requests to CLI2.

DavidAnson avatar Aug 21 '20 16:08 DavidAnson

Just tried [email protected] in a real project in which I experienced slow performance with [email protected]. Seeing:

RangeError: Maximum call stack size exceeded
    at main (/path/to/project/node_modules/markdownlint-cli2/markdownlint-cli2.js:179:30)
    at async /path/to/project/node_modules/markdownlint-cli2/markdownlint-cli2.js:356:9
error Command failed with exit code 2.

The stack points to this line: dirInfo.parent.files.push(...dirInfo.files);. Happy to help with the investigation if there are ideas or new versions to try out 🙂

kachkaev avatar Aug 22 '20 12:08 kachkaev

Great, thank you for letting me know! We should probably move this investigation to an issue in the repository for CLI2. Could you please open something there? Looking at the info above, that’s not a particularly deep call stack, so I’m guessing this is more to do with Promise chains? Also, it might help to know which repository you are seeing this problem with. If you aren’t comfortable sharing that publicly, we can work something out.

DavidAnson avatar Aug 22 '20 16:08 DavidAnson

Quick update, I don’t need an example to reproduce this, I think I figured out what’s going on by inspection. (The message is a little bit misleading.)

DavidAnson avatar Aug 23 '20 02:08 DavidAnson

@kachkaev, that RangeError should be fixed in markdownlint-cli2 version 0.0.6, thanks!

Some info about the bug: https://twitter.com/DavidAns/status/1297361113386872833

DavidAnson avatar Aug 23 '20 22:08 DavidAnson