doublestar icon indicating copy to clipboard operation
doublestar copied to clipboard

any benchmark

Open storyicon opened this issue 5 years ago • 11 comments

When I thought about developing a wheel, I found it, which would save me a lot of time, thanks! So will there be any official benchmark?

storyicon avatar Jun 04 '19 09:06 storyicon

I think it would be better to provide a function like IsValidPattern(string) (bool, error) to let the developer know whether an expression is valid before call Match(string, string)(bool, error).

storyicon avatar Jun 04 '19 10:06 storyicon

Hey @storyicon, that's not a bad idea. My original intent with this library was to implement a drop-in replacement for path.Match, filepath.Match, and filepath.Glob, so I didn't think to implement some kind of validation method (as path and filepath don't have validation methods either). That said, the complexity of a validation function would be on par with Match/PathMatch and, so, it's probably more efficient to just check for a returned error.

I'll keep this ticket open, though... if I find some free time, I can take a stab at this =)

bmatcuk avatar Jun 16 '19 12:06 bmatcuk

Oh, as for benchmarks: I don't have any benchmarks, but I'd welcome some. Are there other globbing libraries for go that support ** to compare against? I'm not aware of any. Speed wasn't really my goal when I wrote the library, so I wouldn't be surprised if it is "slow".

bmatcuk avatar Jun 16 '19 13:06 bmatcuk

https://github.com/gobwas/glob also supports **, so that's a good candidate to benchmark against.

houqp avatar Mar 08 '20 06:03 houqp

Related - https://github.com/bazelbuild/bazel-gazelle/issues/786

What was the reasoning not to delegate to path.Match if a pattern didn't include "**"?

robfig avatar Aug 13 '20 02:08 robfig

To be clear, the tool you are using is giving you a cumulative sum of all allocations. In this example, doublestar is called in a tight loop - any allocation in that loop will appear to balloon to some unreasonable cumulative sum.

But I'm curious how it got so high, too: the line in question is simply allocating a slice of strings - the length of the slice would be equal to the number of path segments. I'm not sure how go allocates a slice of strings, but I'd assume it to be an array of pointers. If we assume a 64-bit machine and an average of 10 path segments (which, honestly, is probably a bit high), such an array should only allocate 80 bytes plus some overhead for go's record keeping. That would mean doublestar is getting called more than 5 million times?...

Unless the tool is also counting the string allocations as part of the slice. If we assume the average path to be 100 characters (again, probably high, but will give us an idea), that'd give us 100 characters for string allocations, plus 80 for the array, and some overhead for go's record keeping - let's say 300 bytes total because I don't know how much overhead go needs, but surely 120 bytes is enough overhead for 10 strings an a slice. That still means doublestar is getting called ~1.4 million times?

Can I do more to minimize allocations in doublestar? Maybe - as I said above, my goal with doublestar was to build a drop-in replacement for Match and Glob that could handle **, not to build a more performant replacement. But removing the allocations isn't going to fix the fact that it's getting called millions of times in a tight loop.

I see that you submitted a PR to use path.Match if the pattern doesn't include ** - I'd be curious to know if you saw any performance benefits (in terms of time, not memory).

bmatcuk avatar Aug 13 '20 13:08 bmatcuk

Yes, I believe it was called 2 - 5M times. There is a list of patterns that is compared to every file in a repository. It could probably be more efficient, but I think this is pretty much in line with how one would use a library of this type. Gazelle does really non-trivial work, so having this one minor piece of its functionality account for 50% of its allocations in a standard use case seems unnecessary. It's not a microbenchmark or anything.

I didn't measure the difference in time. But I guess the question is, if your explicit goal is to have the same functionality as path.Match, then why not peel off just the patterns that include **? That avoids any chance of subtle differences or bugs. It's an extra bonus that it's also more efficient.

robfig avatar Aug 13 '20 19:08 robfig

It's only more efficient in terms of memory allocation, at the cost of searching the pattern for ** (and respecting escaped sequences). And, doing so would increase the chance of subtle differences and bugs: while it is my goal to make doublestar a drop-in replacement for Match and Glob, including any quirks in their behavior, I can't promise with absolute certainty that doublestar behaves in exactly the same way as Match and Glob. If I branch to Match/Glob in cases where the pattern does not contain **, there may be differences in the way doublestar behaves depending on the pattern - this is undesirable.

Keep in mind, doublestar is not allocating 400MB of memory: it's allocating a couple hundred bytes, doing it's work, and returning. I think that's perfectly reasonable. Go's GC should clean it up. What's gazelle's actual memory usage while it's running, as reported by the OS? I suspect it'd be lower than 900MB, but, it's possible that the GC routine isn't running if gazelle is relatively fast and does its work in a tight loop.

I'd be interested in gazelle's memory usage, as reported by the OS and by pprof, with your change that removes doublestar vs gazelle before your change. And some measurement of runtime with and without doublestar as well. That'd give some pretty useful data points.

bmatcuk avatar Aug 13 '20 21:08 bmatcuk

If strings.Contains allocated a few bytes of memory, would everyone be happy with that? I don't think so, because it's unnecessary for the work it has to do, and it's used as a foundational library to build more complicated tools which may need to call it very often.

Anyway, I'm not going to litigate this point, I was just trying to provide feedback about the usage of your library in the wild. I'm not going to look into timings -- if you were interested, I think that writing benchmarks would be the most direct and best step to know how it compares to path.Match.

robfig avatar Aug 20 '20 01:08 robfig

Contrived benchmarks will not be nearly as useful as timings from a real-world use-case. You say you want to provide feedback about usage in the wild, and you already have the perfect test environment...

bmatcuk avatar Aug 20 '20 12:08 bmatcuk

I just cut doublestar v4 which is a complete rewrite with performance in mind. The repo includes some benchmarks and the results are pretty much identical to path.Match(), filepath.Match(), and io/fs.Glob(). v4 also switches to the io/fs package for Glob, so you'll need go v1.16+.

bmatcuk avatar Apr 25 '21 00:04 bmatcuk