safe-regex
safe-regex copied to clipboard
Implement Weideman's analysis to avoid false positives
I guess you mean these things from @ChALkeR's https://github.com/substack/safe-regex/issues/12#issuecomment-368331533 ?
- Code: https://github.com/NicolaasWeideman/RegexStaticAnalysis
- Paper: http://scholar.sun.ac.za/bitstream/handle/10019.1/102879/weideman_static_2017.pdf
Yes. This would be a relatively difficult PR.
Algorithm and notes
- A similar algorithm is described by Wustholz et al. here, although last I checked they only shared a .jar file, not their implementation.
- Rathnayake and Thielecke have also studied this problem (exponential regexes only), code here
- The regexp-tree can build an NFA transition table which is the starting point of these algorithms.
- regexp-tree can't convert all regex features into an NFA transition table. Some of these may be due to "syntax sugar"-type problems (e.g. you could rewrite the regex into an equivalent one that the module supports), while others are due to features that cannot be expressed using an NFA (e.g. backreferences).
- The current heuristic (star height) works on anything regex-tree can parse, which is a superset of the things it can convert into NFAs. So the star height heuristic is a good fallback.
- Algorithms have false positives, so would want to dynamically validate using a timeout on a worker threads.
Science
I would love to study the npm ecosystem ripple effects if safe-regex suddenly supported this algorithm. My experiments suggest that about 10% of the regexes in JavaScript are super-linear (see Figure 6 and section 7.2.2 in my Lingua Franca paper), so a lot of projects would get warnings if the updated safe-regex were landed in eslint.
Anybody who helped out with this process would be welcome to serve as a co-author on a resulting publication.
Hey @davisjam
I'm using safe-regex in my Node.js & JS SAST scanner: https://github.com/fraxken/js-x-ray
One of the project i work on is to analyze the whole npm Registry for 2021 (i still have month of works before pulling real exploitable stats).
- https://github.com/fraxken/npm-security-fetcher
- https://github.com/fraxken/star-analytica (you can check the root results).
You can also check the Node-Secure project who use JS-X-Ray. So you can have a preview of the regex detected on a given npm package.
If we can move things forward then I would be happy to help. If in the near future you have a specific version and need to collect stats then we can work together :)
Also I remain open to any feedback on the use I make of the package :)
Best Regards, Thomas