ctags icon indicating copy to clipboard operation
ctags copied to clipboard

Use optimized kotlin grammar

Open dolik-rce opened this issue 1 year ago • 6 comments

I have been developing pegof as a hobby project for past few years. Recently it became good enough to optimize Kotlin grammar, making it 3 times faster and about 2.7 times less memory hungry.

With this PR, I'd like to start a discussion on how to best use this improvement in ctags. I'd like to keep the original, "human readable" version of the grammar in the repo to ensure simple updates and readable git history. On the other hand, I don't think it would be good idea to make ctags build to depend on pegof, it is not available on all platforms. So my proposal is to allow only manual optimization and keep both original and optimized grammar in the repository. What do you think?


Here is a simple benchmark using ctags-codebase to prove that it really makes a difference:

echo "# optimized"
git checkout optimized-kotlin-peg &> /dev/null
make ctags > /dev/null
/usr/bin/time -f "time: %E\nmax memory (kb): %M" ./ctags -f optimized.ctags ../ctags-codebase/code/kotlin/libraries/**/*.kt

echo "# master"
git checkout master &> /dev/null
make ctags > /dev/null
/usr/bin/time -f "time: %E\nmax memory (kb): %M" ./ctags -f master.ctags ../ctags-codebase/code/kotlin/libraries/**/*.kt

echo "# output comparison"
diff optimized.ctags master.ctags
wc -l optimized.ctags master.ctags 

Output:

# optimized
time: 0:13.84
max memory (kb): 447792
# master
time: 0:44.18
max memory (kb): 1204216
# output comparison
31c31
< !_TAG_PROGRAM_VERSION 6.1.0   /885afcb3e/
---
> !_TAG_PROGRAM_VERSION 6.1.0   /8976ec3d2/
   92478 optimized.ctags
   92478 master.ctags
  184956 total

dolik-rce avatar Jun 17 '24 18:06 dolik-rce

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 85.43%. Comparing base (8976ec3) to head (885afcb).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4023   +/-   ##
=======================================
  Coverage   85.43%   85.43%           
=======================================
  Files         235      235           
  Lines       56734    56734           
=======================================
  Hits        48468    48468           
  Misses       8266     8266           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jun 17 '24 18:06 codecov[bot]

Incredible work!

At first glance, the following rules in Makefile.am are the parts we must extend:

.peg.c:
	$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) "$<"
.peg.h:
	$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) "$<"

I assume configure looks for pegof. If it finds pegof, it defines HAVE_PEGOF.

Pseudo code:

if HAVE_PEGOF
.peg.c:
        let-pegof-generate-an-optimized-peg-file "$<" > optimized-"$<" && 
	$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) optimized-"$<"
.peg.h:
        let-pegof-generate-an-optimized-peg-file "$<" > optimized-"$<" && 
	$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) optimized-"$<"
else
# As before
.peg.c:
	$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) "$<"
.peg.h:
	$(packcc_verbose)$(PACKCC) -o $(top_builddir)/peg/$(*F) "$<"
endif

I'll study the command line interface of pegof.

Having the following change may be nice.

 #ifdef HAVE_PACKCC
 	/* The test harnesses use this as hints for skipping test cases */
 	{"packcc", "has peg based parser(s)"},
+#ifdef HAVE_PEGOF
+	{"pegof", "has peg based parser(s) optimized by pegof"},
+#endif
 #endif
 	{"optscript", "can use the interpreter"},
 #ifdef HAVE_PCRE2

BTW, @dolik-rce don't you run CI/CD tests like https://github.com/universal-ctags/libreadtags/pull/53/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47 ?

At least, I think you should have a build (and install) test.

masatake avatar Jun 17 '24 18:06 masatake

Well, using extending configure with HAVE_PEGOF macro would solve the problem with unsupported platforms, but there are other potential problems. The code is not yet stable enough and I definitely wouldn't call it "production quality". I did some extensive testing for the kotlin case (since that was my initial reason to develop pegof, to speed up kotlin parsing in Geany), but using it for all other peg based grammars might be risky. Also, using it only on some systems (pegof doesn't support windows yet and I don't plan to work on it in foreseeable future) could introduce different behavior of ctags across platform, which is not good idea either.

That is why I'd actually prefer to use the manual approach. I know it is not usually considered good practice to commit generated code, but it solves the problem of not having pegof available on MacOS or windows. I can supply a script to do the optimization, to make it simpler for other contributors who maintain PEG grammars. We could probably even make a CI job that would check if the optimized grammars in the repository are up-to-date. It could even suggest updates in case that new version of pegof can generate better optimizations.

BTW, @dolik-rce don't you run CI/CD tests like https://github.com/universal-ctags/libreadtags/pull/53/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47 ?

At least, I think you should have a build (and install) test.

It's on my TODO list :slightly_smiling_face: As well as providing packages, at least for Arch and Debian-based distributions (which I'm familiar with).

dolik-rce avatar Jun 17 '24 19:06 dolik-rce

Well, using extending configure with HAVE_PEGOF macro would solve the problem with unsupported platforms, but there are other potential problems. The code is not yet stable enough and I definitely wouldn't call it "production quality". I did some extensive testing for the kotlin case (since that was my initial reason to develop pegof, to speed up kotlin parsing in Geany), but using it for all other peg based grammars might be risky. Also, using it only on some systems (pegof doesn't support windows yet and I don't plan to work on it in foreseeable future) could introduce different behavior of ctags across platform, which is not good idea either.

That is why I'd actually prefer to use the manual approach. I know it is not usually considered good practice to commit generated code, but it solves the problem of not having pegof available on MacOS or windows. I can supply a script to do the optimization, to make it simpler for other contributors who maintain PEG grammars. We could probably even make a CI job that would check if the optimized grammars in the repository are up-to-date. It could even suggest updates in case that new version of pegof can generate better optimizations.

Understandable but I really want to try pegof.

Years ago, I rewrote the Java parser using packcc. I was disappointed in its performance, which was about five ~ to fifteen times slower than the current crafted Java parser in parsers/c.c. At that time, I needed Pegof.

Even if Pegof is not mature, I would like to provide a way to build Pegof-powered ctags easily. In my experience, I know how the software author is encouraged from the sophisticated bug report of the software.

I will make a pull request based on the HAVE_PEGOF approach. I really want to see the peg-based parsers run faster.

masatake avatar Jun 28 '24 11:06 masatake

I opened #4026. The good news is that I found a bug in pegof during work on #4026. We are enjoying open-source development now.

masatake avatar Jun 28 '24 20:06 masatake

That's exactly why I said that pegof is far from being stable and reliable 🙂 Receiving input like this is very valuable and will make pegof better. So yes, I will enjoy it.

dolik-rce avatar Jun 29 '24 06:06 dolik-rce

Instead of merging the optimized version of the Kotolin grammar file, I integrated pegof into our build scripts. So I close this.

@dolik-rce, when you think pegof is stable enough, we can turn on pegof at https://github.com/universal-ctags/ctags-nightly-build from where we distribute ctags binary.

masatake avatar Jul 10 '24 01:07 masatake