opengrok Negative numbers shown for #Lines and LOC if directory contains file what was excluded

Negative numbers shown for #Lines and LOC if directory contains file what was excluded.

Opengrok 1.7.34 & 1.7.31

Aug 18 '22 17:08 imikec57

Hi I have just noticed same issue with OG 1.7.35. In the directory where negative number appears for the first time (I mean at deepest level), files that are skipped are .gz files.

Sep 05 '22 15:09 ChristopheBordieu

Seeing same issue with OG 1.7.39 Here is a snapshot of AOSP instance:

Dec 05 '22 07:12 tonzhao

Just an FYI,

The negative value is caused by the deleted documents that are still in the Lucene index (multi segments mode). When removing stale files, OG only checks the UID (however, the document might be empty as it was deleted during the previous update of the index). As a result, the size of the parent directory subtracts the size of the modified file over an over again, eventually becomes negative. see https://github.com/oracle/opengrok/blob/d23c7bd6df3698a7c251025721eb4e11932e1cae/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java#L1619

"--reduceSegmentCount" can trigger a "forceMerge" to clean up the empty documents and stale indexes from Lucene database. It seems this option is mandatory (at least for me).

Hope the above information helps.

Dec 09 '22 06:12 tonzhao

Hi

I don't know Lucene.

Do you think that the issue related to empty/useless documents not being cleaned could explain too that searches sometimes report several times the exact same line in a file when the indexing option --reduceSegmentCout is not used?

Dec 09 '22 08:12 ChristopheBordieu

Do you think that the issue related to empty/useless documents not being cleaned could explain too that searches sometimes report several times the exact same line in a file when the indexing option --reduceSegmentCout is not used?

Such behavior would not surprise me, however I cannot replicate this with a repository exhibiting the negative LOC counts that has index with multiple segments and pending deleted documents. How do the search results look exactly ? And what are the queries triggering such results ? Does this also happen when you search via the search API ?

Jan 17 '23 12:01 vladak

I am also unable to replicate this issue (otherwise I would have created a ticket). It happens totally randomly, very likely depending on what new commits are changing in the code and what the user is searching into this part of the code.

At the moment, I am focusing on whether the usage of --reduceSegmentCout solves the issue. On one of our instance (running 1.7.35), issue occured while the option was not set. User reported it to me. It happened when searching a repository heavily used and searched. So, I deleted the index. Then I updated my indexing flow adding the option. And I am now waiting the user faces the issue again. This was done on Jan 3rd. Indexing is run nightly. At least one search into the repo is done daily. Still no occurence.

It is happening whatever the search type (UI and API) despite most users search with UI. Workaround is to delete the index of the project. Once regenerated, the issue is no more occuring at least for some time.

Example of issue: opengrok-search-dups

Jan 17 '23 13:01 ChristopheBordieu

I have just talked to my user. He has confirmed that no occurence has happened since indexing flow is optimizing the indexes. OpenGrok is a tool him and his colleagues are using a lot, searching code very often.

The issue has started to occur once I have upgraded OG from 1.7.19 to 1.7.35.

Jan 17 '23 13:01 ChristopheBordieu

What does the repository history of the UcCreatePeriodView.cpp file look like ? Lots of updates, I guess ?

Jan 18 '23 08:01 vladak

Yes. More than 47k commits. The commit tree for the branch is indeed quite complex. The file itself has lots of commits.

git log --oneline UcCreatePeriodView.cpp | wc -l
141

Looking at OG history page for the file, I discover that 243 commits are reported on 10 pages, with again dups.

So, my workaround to remove dups should target project index but also project historycache pages :-(

Jan 18 '23 15:01 ChristopheBordieu

I've been investigating this for bunch of days. There are 3 issues:

index traversal using the document uid (QueryBuilder.U) visits deleted documents in multi-segment index. This is causing the negative LOC counts by calling fileRemove() for the same path repeatedly. There are ways how to identify which documents are deleted (using Lucene's MultiBits.getLiveDocs(indexReader)) and avoid those during the traversal.
corrupted history cache. I believe this sort of shares the cause with the above issue, i.e. has to do with how the traversal is done, perhaps a side effect of how the IndexDownArgs is accumulated. Maybe change the IndexDownArgs.works from list to set ?
repeated paths in search results. By far, this is the most tricky one I think, also because I cannot reproduce it yet. Using similar technique as for the index traversal would not work because of search result pagination - even if the deleted documents were detected before the search (caching them for given IndexSearcher instance perhaps), it would collide with the slider construction (using searchHelper.getTotalHits() so the total hit count reporting would be wrong unless the whole set of results is traversed which is would probably hurt performance badly) and could actually still produce deleted documents in the search results (filtering out deleted documents vs. page breaks). Someone on SO said that "IndexSearcher.search should not return deleted documents" and hinted that this is caused by how document Fields are setup. My wild guess is that this could be caused by the QueryBuilder.PATH field is being setup. It is TextField with setTokenized(true), setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS), setStored(true). I wonder whether changing the index options to just DOCS would change anything but again with no reproduction scenario this is just an uneducated guess.

Of course, there is still the possibility of reverting the #3983 change and make the segment reduction to 1 implicit once again.

Jan 20 '23 09:01 vladak

Thanks @vladak for the analysis and status. Good to read issue #1 could be fixed, and eventually #2 too. For #3, maybe you could invert the default coming from #3983, ensuring by default Lucene index is optimized and letting an option to not optimize it.

Jan 20 '23 13:01 ChristopheBordieu

I'd say it is all or nothing. Either multi segment indexes work in all aspects of the indexer and webapp, or single segment will have to be forced.

Jan 20 '23 15:01 vladak

It seems I managed to reproduce the case number 3. Previously, the approach I took was a carefully staged small sequence of changes (eventually destined to be a test case) over small set of files combined with:

iwc.setMergePolicy(NoMergePolicy.INSTANCE);
iwc.setMergeScheduler(NoMergeScheduler.INSTANCE);

done in IndexDatabase#update() however that did not always lead to an index with deleted documents (which by itself is interesting; I also used iwc.setInfoStream(System.out); however I am not used to reading Lucene debug messages so cannot really tell what is and what is not going on).

Then I took more brute force approach: using the OpenGrok Git repository, I checked out a historical changeset and then started checking out the repository one changeset at a time and running the indexer (unchanged 1.7.24, so no extra IndexWriter changes) until it reached the HEAD. Eventually it arrived to a point when the following code (which is a subset of what is normally used when performing search via the UI):

        // Try search
        IndexSearcher searcher = new IndexSearcher(indexReader);
        Query query = new CustomQueryParser("defs").parse("xrefFile");
        TopFieldDocs fdocs = searcher.search(query, 2048, Sort.RELEVANCE);
        long totalHits = fdocs.totalHits.value;
        ScoreDoc[] hits = fdocs.scoreDocs;
        for (int i = 0; i < totalHits; i++) {
            int docId = hits[i].doc;
            Document doc = searcher.doc(docId);

            String rpath = doc.get(QueryBuilder.PATH);
            if (rpath == null) {
                continue;
            }
            System.out.println("path: " + rpath);
        }

actually reported hundreds of documents with the same path (.../IndexDatabase.java in this case) in the results.

In the back of my mind I still reserve some probability that this might not be fixable with how the index fields are setup - the "u" field being the central point while its values not being unique in terms (sic!) of the paths embedded therein, at least in the multi-segment index, that is.

Jan 20 '23 18:01 vladak

Looking into the search results, all the returned documents are actually live, i.e. not deleted. In the index there was just one deleted document with the IndexDatabase.java path and it was not part of the search results. So, after all it seems true that "IndexSearcher.search should not return deleted documents" and therefore we do not have to worry about the case number 3 above as the existence of these documents in the index is likely related to the way the uid traversal got broken (or it might be a pre-existing bug - this reproduction scenario is based on Git repository so the quite recently added history based reindex is used). Anyhow, by fixing the traversal, there should be at most one live document for each path in the index.

Jan 20 '23 22:01 vladak

Once you have a fix, I can test a tar.gz quickly. I have a project that very often faces the issue of dups.

Jan 24 '23 08:01 ChristopheBordieu

I wondered how the multiple live documents matching single path come into existence. By using NoMerge index writer settings and carefully stepping through certain changesets in OpenGrok Git repository history for a sequence of indexer runs, I was able to replicate the situation. Here's what happens:

At certain point, a file (in my case I used IndexDatabase.java) is updated and the result is deleted document and new document. Due to the segment merge settings, the deleted document stays in the index. Given this is Git repository, history based reindex is used and therefore IndexDatabase#processFileIncremental() is called. When traversing the index, unlike its sibling processFile() (used for directory traversal method of indexing), this method does not use full uid comparison and stops the uid traversal when it encounters path ordered equal or higher than the current term: https://github.com/oracle/opengrok/blob/56c26420273d8804b01b4c4de87d0e0fb27a8fcf/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java#L1529-L1533

Given the deleted document is the first one visited while during the traversal (due to the uid ordering), the code enters this block: https://github.com/oracle/opengrok/blob/56c26420273d8804b01b4c4de87d0e0fb27a8fcf/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java#L1556-L1569

Specifically, it will try to delete the document matching the uidIter in removeFileDocUid(), however that document is already deleted. Then it adds the path to the work args which eventually creates new document and adds it to the index. This is repeated whenever the file in question is updated, leading to the multiple live documents matching the same path.

So, history based reindex leads to the negative LOCs however also causes these extra documents to be added.

Jan 24 '23 15:01 vladak

Hi @vladak,

Would you have any ETA for the fix of this issue?

Feb 09 '23 11:02 ChristopheBordieu

Hi @vladak,

Would you have any ETA for the fix of this issue?

I spent a week writing the tests for my changes and they are almost done but then had to stop working for a bit. Once I return back to work, it will be a matter of days before I submit a PR.

Feb 09 '23 14:02 vladak

Super, thanks for the fast update! Hope you will come back to work quickly ;-)

Feb 09 '23 14:02 ChristopheBordieu

Many thanks @vladak for the fix! As soon as the new release is done, I will install it on my UAT envs and will report feedback.

Feb 24 '23 13:02 ChristopheBordieu

opengrok opengrok copied to clipboard

Negative numbers shown for #Lines and LOC if directory contains file what was excluded

opengrok
opengrok copied to clipboard