opengrok icon indicating copy to clipboard operation
opengrok copied to clipboard

incremental history sometimes doesn't add proper records

Open tarzanek opened this issue 9 years ago • 6 comments

repos indexed from scratch and then incrementally indexed few times after gets history records missing for certain files we've seen this usually on mercurial repos (might happen on others too) it needs to be investigated how/why can this happen with incremental index and fixed, marking as stopper for now ...

tarzanek avatar Jun 17 '16 08:06 tarzanek

defer, we don't have proper steps to repro, nor root cause

tarzanek avatar Feb 27 '17 13:02 tarzanek

I can still see this problem on our production instance running 1.7.23 in bigger Mercurial repository that has renamed file handling enabled. The sync happens every 4 hours and usually the repository receives only a handful of changesets.

vladak avatar Dec 02 '21 21:12 vladak

Looking at the code in get(): https://github.com/oracle/opengrok/blob/110674a804dc9faa9a665dbac513689aaad3eca2/opengrok-indexer/src/main/java/org/opengrok/indexer/history/FileHistoryCache.java#L674-L686 makes me wonder whether the following situation could cause the problem (assuming Git or Mercurial repository):

  1. repository is updated on disk (assuming it was indexed previously). This will cause a file to be updated.
  2. someone triggers get() for the file (e.g. from the web app). Given the cache for the file is now stale, this will make it updated. The storeFile() is called with mergeHistory = false so the cache file will be overwritten completely.
  3. indexer is run. It will grab the latest indexed changeset (from the OpenGroklatestRev file), the incoming changesets will be extracted, converted to the inverted map and the new history entries will be added to the history cache of the file.

This sort of scenario would produce duplicate entries in the history cache files, however this is not what I am seeing - it is rather truncated history. Anyhow, the above code block is suspicious and I'd rather remove it.

vladak avatar Dec 02 '21 22:12 vladak

Comparing the number of history entries for each cached file in a repo with some 42k files: a checkout that was recently reindexed from scratch had generally more history entries per file than a checkout of the same repo whose history cache has been there for much longer:

find usr/src -type f -name '*.gz' | while read f; do n=$( gzcat $f | grep HistoryEntry |wc -l ); m=$( gzcat ../from-scatch-repo/$f | grep HistoryEntry |wc -l); if [[ $n != $m ]]; then echo "$f $n $m"; fi; done

The number of cache files that had different number of history entries was around 5k which given the arguably light traffic the service receives combined with the fact that the more fresh history cache has always more history entries per file does not support the hypothesis that the problem is in the forced cache refresh done by the get() method.

Interestingly, there are some history cache files in history cache for the repo reindexed from scratch longer time ago that have empty history.

vladak avatar Dec 03 '21 09:12 vladak

Looking at the actual history and history cache of some of the files where there is a discrepancy (i.e. they got reported by the above shell one-liner), I think the renamed file handling is in play here. This is a Mercurial repository, so for one of the problematic files when I do hg log -b default usr/src/foo, this will produce 8 changesets, while hg log -b default -f usr/src/foo will produce 64 entries. The history cache reindexed from scratch has 64 history entries, while the history cache that is older has 7 entries like as if it was created without renamed file handling. Plus the oldest changeset is missing for some reason - likely an artifact of how the history cache is populated.

Makes me wonder if there is some twist where the per-project (and hence per-repository) renamed handling flag gets turned off for some reason. This would in fact correlate with the initial report of this issue - June 2016 - when the renamed file handling was still enabled by default. It was disabled by default in 2017 via e7749aea05c856b1cffc99d68deff7e049d9b6c0.

vladak avatar Dec 03 '21 09:12 vladak

Another potentially notable issue is that the RepositoryInfo copy constructor was not copying the renamed file handling field. This was fixed via #3826.

vladak avatar Dec 03 '21 14:12 vladak