SOLR-16962: Restore ability to configure tlog directory
See: SOLR-16962
This turned into a bigger change than I'd anticipated, but I really didn't see any other way around it. I found that I had to refactor (for clarity) some method/variable names to be more explicit about the distinction between tlog and ulog. Both concepts are indeed distinct, and are required for different purposes in different places.
ulog dir remains the directory that is configured by the updateLog.dir param, and is the parent directory of the tlog dir. The tlog dir is the directory that contains the transaction log files. Specifying ulog dir as the parent directory allows consistent behavior for the normal case (where we want the tlog dir to be implicitly scoped to the core as a consequence of being a subdirectory within the core data dir (normally) or core instance dir), and the special case (where an external shared top-level ulog dir is configured, within which directories must be created that are explicitly scoped to their associated core).
I opted to track the tlog dirs through the core DirectoryFactory, which makes directory deletion safer (can't delete untracked directories) and more flexible (e.g., supporting (or at least not breaking for) hdfs).
When is this going to be merged?
I am in progress of configuring a Solr Cloud cluster and was baffled the ulog dir is silently ignored by the Solr server.
I believe any serious production setup will keep data and tx logs on separate drives so I think this should be merged ASAP, given the documentation clearly says the ulog dir variable is considered.
Agreed, and thanks for the nudge @jkrauss82. This got too close to the 9.6 release and I didn't want to rush it in, but I'll go ahead and merge this today or tomorrow. I'm personally hoping for a 9.7 soon (esp. thinking also of https://github.com/apache/solr/pull/1557).
Thanks, yes I will definitely keep an eye on the builds.
I'll wait a few days -- @markrmiller if you're able to weigh in that would be much appreciated!
I do have a specific question about this:
UpdateLog existingLog = updateHandler.getUpdateLog();
if (this.ulog != null && this.ulog == existingLog) {
// If we are reusing the existing update log, inform the log that its update handler has
// changed. We do this as late as possible.
this.ulog.init(this, core);
}
... specifically, why do we need to do this "as late as possible"? afaict the only thing the UpdateHandler is used for is to update the UpdateLog's ref to the UpdateHandler -- no fields, etc. consulted. This would be far simpler if we always called ulog.init(this, core) from the super ctor (UpdateHandler), as long as ulog != null.
I'm also particularly interested in feedback on this TODO:
// TODO: why do we need fancy concurrency constructs here? I don't see where this would
// actually be called concurrently. Could `TestHdfsUpdateLog.testFSThreadSafety()`
// (introduced by SOLR-7113) in fact be the only place that requires this? Or perhaps we
// need the `init(UpdateHandler, SolrCore)` method in its entirety (or some large portion of
// it) to synchronize on `this`, to ensure visibility of non-final fields that are
// initialized?
The build failure for 962b99f430c56f7387fbf233950fbc12946aaa70 is legit, but not related to this PR. I've opened an issue/PR for that failure.
For the record, considering the above question about concurrency constructs:
I've dug into SOLR-7113, and I'm increasingly convinced that although it's technically correct that UpdateLog.init(UpdateHandler, SolrCore) is not thread safe, the issue title (filed with initial report) is misleading in calling attention to "thread safety", and the issue fix (in the form of making a portion of the method synchronized on fsLock object) is misleading, because afaict UpdateLog.init(UpdateHandler, SolrCore) is never called concurrently in application code, nor should it be.
The issue with SOLR-7113 was not one of thread safety, so much as one of leaking resources and closing them across different scopes. This issue was effectively fixed by disallowing modification of tlog location across core reloads.
My inclination atm is to leave the AtomicBoolean in place -- we need something to shortcircuit repeated init in application code (which does happen, albeit not in a way that needs to be thread safe), and leaving AtomicBoolean is thread-safe against the test that was written for SOLR-7113; so AtomicBoolean gets the job done for both, even if it's not strictly necessary as far as application code goes.
Ok, I'm planning to merge this tomorrow. The most recent commits add some extra documentation, a test fix, clarity, robustness, and simplify a bit (IMO). I will keep an eye on the builds for any related failures.
Proposed CHANGES.txt:
* SOLR-16962: Restore ability to configure tlog directory. Includes substantial refactoring and clarification
of logic and definitions around `ulog` and `tlog`. Refactoring largely focuses on UpdateLog and
HdfsUpdateLog. There is no anticipated change in existing behavior, other than that the `dir` parameter
to the `<updateLog>` element in `solrconfig.xml`, which has long been silently ignored, will now be
respected (Michael Gibney, David Smiley)
You might leave the "refactoring" aspect to the commit message and just put in CHANGES.txt what users ought to know.
You might leave the "refactoring" aspect to the commit message and just put in CHANGES.txt what users ought to know.
Revised:
* SOLR-16962: Restore ability to configure tlog directory. Changes largely focus on UpdateLog and
HdfsUpdateLog. There is no anticipated change in existing behavior, other than that the `dir` parameter
to the `<updateLog>` element in `solrconfig.xml`, which has long been silently ignored, will now be
respected (Michael Gibney, David Smiley)