opengrok
opengrok copied to clipboard
support history traversal based reindex without history cache
As suggested in https://github.com/oracle/opengrok/discussions/4262#discussioncomment-5507017 , it should be possible to support history based reindex even without history cache. This would nicely complement the economical mode (-e) to reduce the amount of space in the data root.
This would require some way to store the latest indexed changeset for all repositories in given project. This is currently stored in the data root in the OpenGroklatestRev files (per repository).
Also, the history traversal checks (namely to make sure all the repositories for given project support history traversal) need to be performed.
This sort of depends on #4063/#4745 because history cache should be tunable per project proper.
New indexer option should be introduced to control this, e.g. --historyCache with on/off value.
If history cache is disabled (for given project), it is a question whether to override the short circuit in HistoryGuru#getHistoryFromRepository() for repositories that support per directory history. To expand on this: if history cache is disabled for a project and yet the history is enabled, the 2nd phase of indexing will attempt to fetch the history for given file in AnalyzerGuru#populateDocument() by calling populateDocumentHistory(). This will attempt to get the history from history cache (HistoryGuru#getHistory() is called with fallback set to true) and when not found it will fall back to the repository method which will bail if the file belongs to a repository capable of getting history for directories (such as Git or Mercurial) because it is inefficient to get the history per file for such repositories.
I am inclined towards leaving this as is, mainly for simplicity. The UI should be capable of displaying the history even though the history cache for given project is off, it just would not be possible to search the history because it is not indexed (for this kind of repositories) for the reason described above.
Although it might be tempting to reuse the current scheme with OpenGroklatestRev files, the change should be done without making this a special case in history cache creation as it would make the HistoryGuru/HistoryCache code fragile and more complex due to the added corner cases. Instead, there should be a new abstraction for getting/setting the last indexed revision number for a repository which will be used by history cache as well as the 2nd phase of the indexer. The data storage is yet to be determined; probably should not be stored in the index because 1st phase of indexing has no knowledge about it. It could reuse the OpenGroklatestRev files with directory structure matching the history cache directory structure in different directory, just not using the history cache code to manipulate it. Life cycle should be considered (repository addition/removal).