Use optimized kotlin grammar
I have been developing pegof as a hobby project for past few years. Recently it became good enough to optimize Kotlin grammar, making it 3 times faster and about 2.7 times less memory hungry.
With this PR, I'd like to start a discussion on how to best use this improvement in ctags. I'd like to keep the original, "human readable" version of the grammar in the repo to ensure simple updates and readable git history. On the other hand, I don't think it would be good idea to make ctags build to depend on pegof, it is not available on all platforms. So my proposal is to allow only manual optimization and keep both original and optimized grammar in the repository. What do you think?
Here is a simple benchmark using ctags-codebase to prove that it really makes a difference:
echo "# optimized"
git checkout optimized-kotlin-peg &> /dev/null
make ctags > /dev/null
/usr/bin/time -f "time: %E\nmax memory (kb): %M" ./ctags -f optimized.ctags ../ctags-codebase/code/kotlin/libraries/**/*.kt
echo "# master"
git checkout master &> /dev/null
make ctags > /dev/null
/usr/bin/time -f "time: %E\nmax memory (kb): %M" ./ctags -f master.ctags ../ctags-codebase/code/kotlin/libraries/**/*.kt
echo "# output comparison"
diff optimized.ctags master.ctags
wc -l optimized.ctags master.ctags
Output:
# optimized
time: 0:13.84
max memory (kb): 447792
# master
time: 0:44.18
max memory (kb): 1204216
# output comparison
31c31
< !_TAG_PROGRAM_VERSION 6.1.0 /885afcb3e/
---
> !_TAG_PROGRAM_VERSION 6.1.0 /8976ec3d2/
92478 optimized.ctags
92478 master.ctags
184956 total
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 85.43%. Comparing base (
8976ec3) to head (885afcb).
Additional details and impacted files
@@ Coverage Diff @@
## master #4023 +/- ##
=======================================
Coverage 85.43% 85.43%
=======================================
Files 235 235
Lines 56734 56734
=======================================
Hits 48468 48468
Misses 8266 8266
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Incredible work!
At first glance, the following rules in Makefile.am are the parts we must extend:
.peg.c:
$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) "$<"
.peg.h:
$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) "$<"
I assume configure looks for pegof. If it finds pegof, it defines HAVE_PEGOF.
Pseudo code:
if HAVE_PEGOF
.peg.c:
let-pegof-generate-an-optimized-peg-file "$<" > optimized-"$<" &&
$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) optimized-"$<"
.peg.h:
let-pegof-generate-an-optimized-peg-file "$<" > optimized-"$<" &&
$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) optimized-"$<"
else
# As before
.peg.c:
$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) "$<"
.peg.h:
$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) "$<"
endif
I'll study the command line interface of pegof.
Having the following change may be nice.
#ifdef HAVE_PACKCC
/* The test harnesses use this as hints for skipping test cases */
{"packcc", "has peg based parser(s)"},
+#ifdef HAVE_PEGOF
+ {"pegof", "has peg based parser(s) optimized by pegof"},
+#endif
#endif
{"optscript", "can use the interpreter"},
#ifdef HAVE_PCRE2
BTW, @dolik-rce don't you run CI/CD tests like https://github.com/universal-ctags/libreadtags/pull/53/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47 ?
At least, I think you should have a build (and install) test.
Well, using extending configure with HAVE_PEGOF macro would solve the problem with unsupported platforms, but there are other potential problems. The code is not yet stable enough and I definitely wouldn't call it "production quality". I did some extensive testing for the kotlin case (since that was my initial reason to develop pegof, to speed up kotlin parsing in Geany), but using it for all other peg based grammars might be risky. Also, using it only on some systems (pegof doesn't support windows yet and I don't plan to work on it in foreseeable future) could introduce different behavior of ctags across platform, which is not good idea either.
That is why I'd actually prefer to use the manual approach. I know it is not usually considered good practice to commit generated code, but it solves the problem of not having pegof available on MacOS or windows. I can supply a script to do the optimization, to make it simpler for other contributors who maintain PEG grammars. We could probably even make a CI job that would check if the optimized grammars in the repository are up-to-date. It could even suggest updates in case that new version of pegof can generate better optimizations.
BTW, @dolik-rce don't you run CI/CD tests like https://github.com/universal-ctags/libreadtags/pull/53/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47 ?
At least, I think you should have a build (and install) test.
It's on my TODO list :slightly_smiling_face: As well as providing packages, at least for Arch and Debian-based distributions (which I'm familiar with).
Well, using extending configure with HAVE_PEGOF macro would solve the problem with unsupported platforms, but there are other potential problems. The code is not yet stable enough and I definitely wouldn't call it "production quality". I did some extensive testing for the kotlin case (since that was my initial reason to develop pegof, to speed up kotlin parsing in Geany), but using it for all other peg based grammars might be risky. Also, using it only on some systems (pegof doesn't support windows yet and I don't plan to work on it in foreseeable future) could introduce different behavior of ctags across platform, which is not good idea either.
That is why I'd actually prefer to use the manual approach. I know it is not usually considered good practice to commit generated code, but it solves the problem of not having pegof available on MacOS or windows. I can supply a script to do the optimization, to make it simpler for other contributors who maintain PEG grammars. We could probably even make a CI job that would check if the optimized grammars in the repository are up-to-date. It could even suggest updates in case that new version of pegof can generate better optimizations.
Understandable but I really want to try pegof.
Years ago, I rewrote the Java parser using packcc. I was disappointed in its performance, which was about five ~ to fifteen times slower than the current crafted Java parser in parsers/c.c. At that time, I needed Pegof.
Even if Pegof is not mature, I would like to provide a way to build Pegof-powered ctags easily. In my experience, I know how the software author is encouraged from the sophisticated bug report of the software.
I will make a pull request based on the HAVE_PEGOF approach. I really want to see the peg-based parsers run faster.
I opened #4026. The good news is that I found a bug in pegof during work on #4026. We are enjoying open-source development now.
That's exactly why I said that pegof is far from being stable and reliable 🙂 Receiving input like this is very valuable and will make pegof better. So yes, I will enjoy it.
Instead of merging the optimized version of the Kotolin grammar file, I integrated pegof into our build scripts. So I close this.
@dolik-rce, when you think pegof is stable enough, we can turn on pegof at https://github.com/universal-ctags/ctags-nightly-build from where we distribute ctags binary.