lucene Improve user facing java docs using LLM (or otherwise)

Background

With the recent release of 10.3, java documentation for geoutils package and other classes within the package significantly clearer and more comprehensive compared to 10.2 release. This improvement was created with the help of an LLM, requiring relatively little effort but yielding much higher-quality documentation.

This suggests an opportunity to improve other Lucene packages in the same way. With LLMs, contributors (including newcomers) can quickly draft stronger package-level docs, which reviewers can then refine.

Proposal

Encourage contributors to use LLMs to draft or refine JavaDocs for under-documented packages. The workflow would be:

Identify a package with minimal or outdated JavaDocs.
Use LLM assistance to generate a clearer package summary (overview, key concepts, common operations, examples).
Submit as a documentation PR following engagement guidelines.

Engagement Guidelines for Contributors

Scope your PRs:
- For substantial improvements, limit changes to ~10–15 files or a logical section (e.g., one package) per PR.
- For minor fixes (typos, formatting), larger file counts are acceptable.
Review before submit: Always read through the LLM-generated docs and adjust wording for accuracy and tone.
Tag this issue: Reference this umbrella issue in your PR description (e.g., “Related to #15225”).
Allow review time: Reviewers are effectively vetting technical documentation, so please be patient with feedback.

Benefits

Raises quality of Lucene documentation across the board.
Lower barrier to contribution (great entry point for new contributors).
Scales documentation improvements with minimal effort while keeping reviews manageable.

Sep 23 '25 21:09 jainankitk

Hi @jainankitk, I would like to take this task. I am a new contributor.

Sep 25 '25 18:09 Maham802

Thanks @Maham802 for expressing interest! This is meta issue and does not need to be assigned to a single owner. There should be enough documentation improvement opportunities within Lucene for many contributors.

Feel free to add me as reviewer. Looking forward to your PR!

Sep 25 '25 19:09 jainankitk

I was looking at this issue and to better understand the issue I looked at the original commit for geo. I believe I have found it here: https://github.com/apache/lucene/commit/bb8b1397b014751a588bfbd9a7d51a6d3c429515. I thought the example might help others understand what to do so I am including it here.

Oct 19 '25 03:10 mpbarano

Regarding LLM generated doc, I wonder if this website has been helpful: https://deepwiki.com/apache/lucene. The best feature I like is that it can redirect to the code snippet as source of truth if you ask any questions.

Oct 21 '25 07:10 indexalice

Regarding LLM generated doc, I wonder if this website has been helpful: https://deepwiki.com/apache/lucene. The best feature I like is that it can redirect to the code snippet as source of truth if you ask any questions.

While resources like https://deepwiki.com/apache/lucene are good, I believe they have not been reviewed by any committer. So, not sure if it can be relied upon to be the source of truth. Hence imo javadocs still have their place atleast in the short term until LLM improves even further!

Oct 24 '25 06:10 jainankitk

Thanks @mpbarano for adding reference to the original commit for geo and making your first contribution in Lucene. Will add my review to the PR shortly!

Oct 24 '25 06:10 jainankitk

I don't see the purpose of generating docs with LLM. User can always do this themselves.

Docs should contain information that can't be automatically generated from the source code.

To me this only creates technical debt, sorry.

Oct 24 '25 12:10 rmuir

I don't see the purpose of generating docs with LLM. User can always do this themselves.

Docs should contain information that can't be automatically generated from the source code.

To me this only creates technical debt, sorry.

Using LLM alone at least currently doesn't generate fully accurate documentation. Using LLMs to generate documentation requires careful reading and vetting of the docs.

In my recent experience I have found LLMs routinely generate errors (I am using Cursor default which I believe is Claude). I have been reading everything 3-4 time, testing what I can, and fixing mistakes. Some example mistakes:

LLM misordered and misnumerated parameters for some methods in its code examples.
I was looking at docs for HighFreqTerms command line and it documented that you could view terms by numeric fields isn't sensible input and fails.
There were errors in how it documented the command line.

Oct 24 '25 17:10 mpbarano

For javadocs in particular, I am concerned about how these chatbots generate text that is 400% more verbose than it needs to be.

We need to be very careful around this, such javadocs will slow development velocity. Any time the code is refactored, the docs must be fixed to match, or various linters will start to fail.

Code examples need to be kept extremely minimal in the rare cases where it does make sense: the maintenance cost is very high and we don't have great validation for these.

Oct 25 '25 12:10 rmuir

I echo @rmuir sentiments and concerns. Creating LLM output that is only useful once its been ingested into another LLM to summarize is very "dead internet".

We should write docs that are useful to humans first (a novel concept!). They should be pointed and focus on the direct use case & possibly complicated/non-obvious edges.

I have read many of these LLM generated docs in other repos (and issues, PR descriptions, etc.). It's bland, tasteless, and lost much of what makes open source fun and interesting.

I seriously doubt an LLM will come up with something as entertaining as:

/** Codec that tries to use as little ram as possible because he spent all his money on beer */
// TODO: better name :)
// but if we named it "LowMemory" in codecs/ package, it would be irresistible like optimize()!
public class CheapBastardCodec extends FilterCodec {

Oct 27 '25 20:10 benwtrent

this comes pretty close though:

Oct 27 '25 21:10 rmuir

..it's like trying to drain the swamp, and the swamp keeps finding new ways to refill itself. LOL!! Which LLM and what prompt created this @rmuir! Can it generate the audio in @uschindler's voice too?

I find myself being amazed by what these LLMs can do now (generating complex code), and then aghast at the silly mistakes/hallucinations they make ... brain whiplash. Claude recently wrote up a big, helpful response to me, with a list of items, except it numbered all of the items as 1, yet in the text referred to them as items 1, 2, 3. Head scratching...

I think LLMs, with targeted prompts, could be useful for our javadocs? E.g., could we prompt to dig through our existing docs and correct any code examples that are stale? Or maybe to add javadocs to complex methods that are missing their @param explanations? With the right prompting, and the right genai, maybe running in the "just think harder" mode/model, should be useful here.

Can we invoke genai (CoPIlot?) from GitHub actions? Can it comment on our PRs for silly mistakes like failing to use == false haha. If a javadoc change is made in a PR and the code example is wrong (doesn't compile or run, etc.), it could comment?

But I agree we should tread carefully, review closely, and if it's collectively taking tons of human time to review short genai efforts overall (sort of an AI denial-of-service-attack on we humans), then that's no good.

Nov 13 '25 12:11 mikemccand

..it's like trying to drain the swamp, and the swamp keeps finding new ways to refill itself. LOL!! Which LLM and what prompt created this @rmuir! Can it generate the audio in @uschindler's voice too?

This was just using whatever is on the android phone (gemini?): it is actually the only time I have used the functionality. To make it funny, I repeatedly said "ok, do it again, but make it funnier".

I think LLMs, with targeted prompts, could be useful for our javadocs? E.g., could we prompt to dig through our existing docs and correct any code examples that are stale? Or maybe to add javadocs to complex methods that are missing their @param explanations? With the right prompting, and the right genai, maybe running in the "just think harder" mode/model, should be useful here.

I guess that is more fun than documenting your code. I think if you spent same amount of time just documenting your code, you'd achieve the same or better result.

Can we invoke genai (CoPIlot?) from GitHub actions? Can it comment on our PRs for silly mistakes like failing to use == false haha. If a javadoc change is made in a PR and the code example is wrong (doesn't compile or run, etc.), it could comment?

I don't think we should. These are just hype/scam, the latest cryptocurrency, I don't think they should be used for any serious software development. Mechanical parrots

Nov 20 '25 10:11 rmuir