tiny-glob icon indicating copy to clipboard operation
tiny-glob copied to clipboard

Some files are missing when doing glob

Open chenxsan opened this issue 7 years ago • 10 comments

I have this code:

              const searchPattern = '{post,page}/**/*.{md}'
              glob(searchPattern, {
                cwd: content,
                absolute: true
              })
                .then(files => {
                  console.log(files.length)
                  callback(null, files)
                })
                .catch(err => callback(err))

And I would expect the files for {post,page}/**/*.{md} = {post}/**/*.{md} + {page}/**/*.{md}, but it's not in my case.

I have 2 files for {page}/**/*.{md}, and 53 files for {post}/**/*.{md}, but only 33 for {post,page}/**/*.{md}. Am I doing something wrong here? The searchPattern just works fine under node-glob, fast-glob.

chenxsan avatar Sep 01 '18 05:09 chenxsan

Here's the one causing problem https://github.com/terkelg/tiny-glob/blob/master/index.js#L34, I can remove it, and everything works as expected in my case. Also, it won't fail any tests.

chenxsan avatar Sep 01 '18 16:09 chenxsan

I just put up a failing test here https://github.com/chenxsan/tiny-glob/commit/18f54bc003973335915f333ec0ab841ef2c82c71

chenxsan avatar Sep 01 '18 17:09 chenxsan

Hi Sam! Thanks for having a look at the issue. Does it effect the benchmarks when you remove that line? It can also be a problem with the regex coming from globrex. Can you print the regex and file and test them?

The idea is that every path segment (aka dir name) is checked before tiny-glob starts crawling that directory. When globrex convert a glob it also break it into smaller regex segments for each folder/path segment. This is done so tiny-glob can check each directory and avoid spending time crawling unnecessary folders that never will result in any matches anyway. I suspect the regex for the glob {post,page} could be wrong.

terkelg avatar Sep 02 '18 15:09 terkelg

Here's benchmark result after I removed that line:

glob x 13,438 ops/sec ±3.05% (83 runs sampled)
fast-glob x 25,485 ops/sec ±5.20% (76 runs sampled)
tiny-glob x 55,162 ops/sec ±6.97% (55 runs sampled)
Fastest is tiny-glob
┌───────────┬─────────────────────────┬────────────┬────────────────┐
│ Name      │ Mean time               │ Ops/sec    │ Diff           │
├───────────┼─────────────────────────┼────────────┼────────────────┤
│ glob      │ 0.00007441320916659413  │ 13,438.474 │ N/A            │
├───────────┼─────────────────────────┼────────────┼────────────────┤
│ fast-glob │ 0.00003923935167426461  │ 25,484.621 │ 89.64% faster  │
├───────────┼─────────────────────────┼────────────┼────────────────┤
│ tiny-glob │ 0.000018128419620750812 │ 55,162.006 │ 116.45% faster │
└───────────┴─────────────────────────┴────────────┴────────────────┘

chenxsan avatar Sep 03 '18 00:09 chenxsan

Here's the lexer variable:

{ regex: /^(post|page)\/((?:[^\/]*(?:\/|$))*)([^\/]*)\.md$/,
  segments: [ /^(post|page)$/, /^((?:[^\/]*(?:\/|$))*)$/, /^([^\/]*)\.md$/ ],
  globstar: '/^((?:[^\\/]*(?:\\/|$))*)$/' }

And part of my directory structure:

$ tree ./post

image All those .md files right under post are included while others inside subdirectory of post are filtered out.

So I just added a console.log(rgx, file) right before if (rgx && !rgx.test(file)) continue;, here's the printed result:

/^(post|page)$/ 'draft'
/^(post|page)$/ 'page'
/^((?:[^\/]*(?:\/|$))*)$/ 'about'
/^(post|page)$/ 'post'
/^([^\/]*)\.md$/ 'firefox-os'
/^([^\/]*)\.md$/ 'github-pages-custom-domain'
/^([^\/]*)\.md$/ 'markdown-and-table'
/^([^\/]*)\.md$/ 'srcset and sizes'
/^([^\/]*)\.md$/ 'telegram-scam-bitcoin'
...
...

Those are folders right under post, they should be walked into too. So the problem here might origin from the level value?

chenxsan avatar Sep 03 '18 00:09 chenxsan

What's going on with the directory names? Can you post the non-escaped strings for some of them?

terkelg avatar Sep 03 '18 08:09 terkelg

Those're chinese characters.

image

But I don't think it matters. Folders with english names like firefox-os are ignored by glob too.

chenxsan avatar Sep 03 '18 08:09 chenxsan

@terkelg Thought it's really difficult for you to understand my situation, I just created a new repo here https://github.com/chenxsan/tiny-glob-demo.

Please run npm install to install deps then run node index.js to check the results.

chenxsan avatar Sep 03 '18 08:09 chenxsan

Thanks a lot @chenxsan. I'll have a look at this when I get some spare time. I appreciate the help and information you provided

terkelg avatar Sep 03 '18 15:09 terkelg

Its possible that i have the same problem, but in different form.

I prepared repo with test case: git clone https://github.com/pavelloz/tg-testcase && npm i && node test.js https://github.com/pavelloz/tg-testcase/

Shortcut:

Structure:

tinyglob-testcase|master ⇒ tree modules 
modules
└── test
    ├── private
    │   └── views
    │       └── pages
    │           └── mypage.liquid
    └── public
        └── views
            ├── pages
            │   └── page.liquid
            └── partials
                ├── data
                │   ├── one.liquid
                │   └── two.json
                └── hello.liquid

9 directories, 5 files

Code:

const tg = require('tiny-glob');

tg('**', {
  cwd: 'modules/test',
  filesOnly: true
}).then(files => {
  console.log('Non-filtered.', files.length);
  console.log(files);
});

tg('{private,public}/**', {
  cwd: 'modules/test',
  filesOnly: true
}).then(files => {
  console.log('Filtered by private/public (broken)', files.length);
  console.log(files);
});


tg('**/{private,public}/**', {
  cwd: 'modules/test',
  filesOnly: true
}).then(files => {
  console.log('Filtered, with workaround/hack applied.', files.length);
  console.log(files);
});

Results

tinyglob-testcase|master ⇒ node test.js 
Filtered by private/public (broken) 1
[ 'private/views/pages/mypage.liquid' ]
Non-filtered. 5
[
  'private/views/pages/mypage.liquid',
  'public/views/pages/page.liquid',
  'public/views/partials/data/one.liquid',
  'public/views/partials/data/two.json',
  'public/views/partials/hello.liquid'
]
Filtered, with workaround/hack applied. 5
[
  'private/views/pages/mypage.liquid',
  'public/views/pages/page.liquid',
  'public/views/partials/data/one.liquid',
  'public/views/partials/data/two.json',
  'public/views/partials/hello.liquid'
]

pavelloz avatar Jul 29 '19 14:07 pavelloz