libphonenumber-csharp
libphonenumber-csharp copied to clipboard
Performance issues
I've been using IKVM to "cross-compile" the libphonenumber jars to a .Net DLL for a long time. Unfortunately, IKVM is no longer maintained (for quite some while now). Today I replaced the cross-compiled DLL with this library, which was a breeze since the public API is, besides .Net conventions like getInstance() -> GetInstance() and a single incompatibility with TruncateTooLongNumber requiring a builder, very compatible. All was well, all my unittests passed. 🎉🥳 So thank you, and kudo's for the great work on making this such an easy transition.
However, then I started benchmarking and, it turns out, this lib is slower by a factor 30. When handling a lot of phonenumbers that is quite a lot. I have browsed the issues and found some and I'm quite sure and convinced the culprit is the regex(es) used in this lib. I have tested it on .Net 5.0 in a windows console application.
I wish I could dive in deeper and profile the issue(s) and put in some work to help alleviate this but I'm not in a position right now - too busy. However, should I have some spare time; which (of the many) branch(es) should I base my work on? What would be my best bet? I can't make any promises but I would love to help improve this library.
Hi there, any support is always appreciated. The regexs almost certainly are the bottleneck, there's been some work to make things faster https://github.com/twcclegg/libphonenumber-csharp/pull/121. Master would be the place for any contribution. Thanks in advance (and also for prompting me to clean up the stale branches).
With the regex performance improvements in the earlier this week released .Net 7 there could be some real difference. I haven't gotten around to it since july 2021 but I hope to play a little with this and see what the differences are soon.
Ok, so I haven't done anything with the .Net versions (yet) but could improve performance already with about a factor of 10 but I did have to sacrifice the RegexCache's internal LRU. As the commit says:
I have sacrificed the LRU cache, which is why the
TestRegexInsertioncurrently fails. This means Regexes are currently cached indefinitely. I wonder in how many real-world scenario's caching a few hundred(?) regexes would be an actual problem. Unfortunately I have no numbers on the use of this class nor any idea about the environments this class is used in. But the added complexity of the LRU (and making it thread-safe) seems not worth it to me. Especially if this easy optimization alone results in a 10x improvement.
Edit: I just realized I could have a look at the XML's (or whetever where the regexes come from) and, for a quick estimate, figure out how many unique regexes there are.
A very quick'n'dirty experiment resulted in 1401 unique regexes in PhoneNumberMetadata.xml by my count; which resulted in around 10MB of memory usage if I instantiate each of them as new Regex(<regex>, RegexOptions.Compiled). As each PhoneRegex has a possible of 3 variants of each regex let's assume 3 x 10 = 30. Things get complicated though since I can't imagine a situation where you have all three variants of the entire 1401 regexes in use. But, if you do, worst case is around 30MB of memory.
Being born in the 70's, having grown up with a 16KB Acorn and later C64 with it's 64KB of memory I wholeheartedly agree that 30MB is a lot. Then again, that is:
- worst case where you have all three variants of all regexes in memory (unless I missed something) and...
- in 2022 on machines that won't even boot with less then 8GB of memory (including phones etc.)
Having said that, more limited devices like IoT MCU's etc. are a different story - but how many of them run .Net, how many of those use this library and how many of those run into the worst case scenario?
So I'm on the fence; on one hand I think "30MB, worst case, who cares?" and on the other hand I think being a little frugal with memory is the right thing to do. So... let's discuss?
Edit 1: Most notably the TestFixedLine goes from 3.2 seconds to 8 milliseconds.
~Edit 2: you can see my benchmark project, result and (preliminary) conclusion here. For now, it's time to go enjoy my weekend. I hope to continue this work soon.~ I clearly needed my weekend, the benchmarks made no sense and had too many hidden variables and unknowns so I deleted the project for now. The unittest improvements (10x faster) still stands though.
Either way: I haven't even gotten to the part where we could (potentially) profit from the improved Regex performance in .Net 7 which is why I picked this issue back up in the first place today.
I ran some more tests/benchmarks with the exact same data I used back in juli 2021. These are my results; each test run twice:
Java Crosscompiled with IKVM
| Method | Mean | Error | StdDev |
|---|---|---|---|
| TryParseNumbers | 61.27 ms | 1.143 ms | 1.069 ms |
| TryParseNumbers | 61.77 ms | 0.952 ms | 0.890 ms |
Native libphonenumber-csharp before #161
| Method | Mean | Error | StdDev |
|---|---|---|---|
| TryParseNumbers | 1.653 s | 0.0112 s | 0.0105 s |
| TryParseNumbers | 1.623 s | 0.0175 s | 0.0163 s |
Native libphonenumber-csharp after #161
| Method | Mean | Error | StdDev |
|---|---|---|---|
| TryParseNumbers | 23.98 ms | 0.254 ms | 0.238 ms |
| TryParseNumbers | 23.95 ms | 0.248 ms | 0.232 ms |
As you can see, libphonenumber-csharp is now ~ 3x faster than the cross-compiled Java DLL we've been using until now for my specific testcases (which is ran on a dataset I cannot disclose unfortunately).
Either way - all I can see, from my end, are improvements of at least one order of magnitude. YMMV.
As for the improvements made in .Net 7 in the regexe department; since a lot (most) regexes are created dynamically using the [GeneratedRegex] attribute appears to be out of the question; this would've helped in creating compile-time optimized regexes. However, regexes are compiled (and were before #161 in fact), runtime unfortunately, but also (still) cached.
This port now runs ~3x faster for me than the currently used cross-compiled version so I am happy. All that I would like to see is #161 being merged and a new NuGet package being released 😇 Which just leaves the discussion on memory usage and LRU being used.
Edit: Oh, and it appears the unittests aren't run...
I'd agree that the additional memory usage is worth it given the amount of speedup. I can go ahead and merge #161. Also I'll look into what caused the tests to stop running in the CI. Once merged the changes will be released when Google puts out the next metadata update, in around a week if I had to guess.
I'd agree that the additional memory usage is worth it given the amount of speedup.
Cool, thanks.
Also I'll look into what caused the tests to stop running in the CI.
If I had to guess: wrong image/container/something after switching to .net 6. Before you fix that, you may want to remove the TestFixedLine unittest (after which you may also want to get rid of the ContainsRegexmethod on theRegexCache` class).
Once merged the changes will be released when Google puts out the next metadata update, in around a week if I had to guess.
Yeah, they're on a fairly regular schedule, but I'm in no (big) hurry.
Thanks!
Merged. Thanks again!