opengrok
opengrok copied to clipboard
incremental history sometimes doesn't add proper records
repos indexed from scratch and then incrementally indexed few times after gets history records missing for certain files we've seen this usually on mercurial repos (might happen on others too) it needs to be investigated how/why can this happen with incremental index and fixed, marking as stopper for now ...
defer, we don't have proper steps to repro, nor root cause
I can still see this problem on our production instance running 1.7.23 in bigger Mercurial repository that has renamed file handling enabled. The sync happens every 4 hours and usually the repository receives only a handful of changesets.
Looking at the code in get(): https://github.com/oracle/opengrok/blob/110674a804dc9faa9a665dbac513689aaad3eca2/opengrok-indexer/src/main/java/org/opengrok/indexer/history/FileHistoryCache.java#L674-L686
makes me wonder whether the following situation could cause the problem (assuming Git or Mercurial repository):
- repository is updated on disk (assuming it was indexed previously). This will cause a file to be updated.
- someone triggers
get()for the file (e.g. from the web app). Given the cache for the file is now stale, this will make it updated. ThestoreFile()is called withmergeHistory = falseso the cache file will be overwritten completely. - indexer is run. It will grab the latest indexed changeset (from the
OpenGroklatestRevfile), the incoming changesets will be extracted, converted to the inverted map and the new history entries will be added to the history cache of the file.
This sort of scenario would produce duplicate entries in the history cache files, however this is not what I am seeing - it is rather truncated history. Anyhow, the above code block is suspicious and I'd rather remove it.
Comparing the number of history entries for each cached file in a repo with some 42k files: a checkout that was recently reindexed from scratch had generally more history entries per file than a checkout of the same repo whose history cache has been there for much longer:
find usr/src -type f -name '*.gz' | while read f; do n=$( gzcat $f | grep HistoryEntry |wc -l ); m=$( gzcat ../from-scatch-repo/$f | grep HistoryEntry |wc -l); if [[ $n != $m ]]; then echo "$f $n $m"; fi; done
The number of cache files that had different number of history entries was around 5k which given the arguably light traffic the service receives combined with the fact that the more fresh history cache has always more history entries per file does not support the hypothesis that the problem is in the forced cache refresh done by the get() method.
Interestingly, there are some history cache files in history cache for the repo reindexed from scratch longer time ago that have empty history.
Looking at the actual history and history cache of some of the files where there is a discrepancy (i.e. they got reported by the above shell one-liner), I think the renamed file handling is in play here. This is a Mercurial repository, so for one of the problematic files when I do hg log -b default usr/src/foo, this will produce 8 changesets, while hg log -b default -f usr/src/foo will produce 64 entries. The history cache reindexed from scratch has 64 history entries, while the history cache that is older has 7 entries like as if it was created without renamed file handling. Plus the oldest changeset is missing for some reason - likely an artifact of how the history cache is populated.
Makes me wonder if there is some twist where the per-project (and hence per-repository) renamed handling flag gets turned off for some reason. This would in fact correlate with the initial report of this issue - June 2016 - when the renamed file handling was still enabled by default. It was disabled by default in 2017 via e7749aea05c856b1cffc99d68deff7e049d9b6c0.
Another potentially notable issue is that the RepositoryInfo copy constructor was not copying the renamed file handling field. This was fixed via #3826.