Provide getProperties method on Configuration that does not filter
The getProperties method which takes a filter is inefficient when you know that you want to get all properties or you know the names of the properties. Provide a mechanism to get the properties w/out using a filter
User was running tests against 2.1.0-SNAPSHOT and found that getting properties was consuming an unexpected amount of CPU time. User provided stack trace:
"Split/MajC initiator" #51 daemon prio=5 os_prio=0 cpu=8876417.63ms elapsed=70703.21s tid=0x0000561224710000 nid=0x29e1 runnable [0x00007fbceee29000] java.lang.Thread.State: RUNNABLE
at org.apache.accumulo.core.conf.DefaultConfiguration.lambda$getProperties$1(DefaultConfiguration.java:59)
at org.apache.accumulo.core.conf.DefaultConfiguration$$Lambda$247/0x00000008403ce840.test(Unknown Source)
at java.util.stream.ReferencePipeline$2$1.accept([email protected]/ReferencePipeline.java:176)
at java.util.HashMap$EntrySpliterator.forEachRemaining([email protected]/HashMap.java:1764)
at java.util.stream.AbstractPipeline.copyInto([email protected]/AbstractPipeline.java:484)
at java.util.stream.AbstractPipeline.wrapAndCopyInto([email protected]/AbstractPipeline.java:474)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential([email protected]/ForEachOps.java:150)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential([email protected]/ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate([email protected]/AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach([email protected]/ReferencePipeline.java:497)
at org.apache.accumulo.core.conf.DefaultConfiguration.getProperties(DefaultConfiguration.java:60)
at org.apache.accumulo.core.conf.SiteConfiguration.getProperties(SiteConfiguration.java:272)
at org.apache.accumulo.core.conf.SiteConfiguration.getProperties(SiteConfiguration.java:266)
at org.apache.accumulo.server.conf.ZooBasedConfiguration.getProperties(ZooBasedConfiguration.java:127)
at org.apache.accumulo.core.conf.AccumuloConfiguration.get(AccumuloConfiguration.java:84)
at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:62)
at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:98)
at org.apache.accumulo.core.conf.AccumuloConfiguration.getAsBytes(AccumuloConfiguration.java:240)
at org.apache.accumulo.tserver.tablet.Tablet.findSplitRow(Tablet.java:1196)
at org.apache.accumulo.tserver.tablet.Tablet.needsSplit(Tablet.java:1344)
- locked <0x0000000547991f30> (a org.apache.accumulo.tserver.tablet.Tablet)
at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:457)
The DefaultConfiguration.getProperties method is inefficient (O(n)) when you know which properties you want to pull from the map (O(1)).
I agree with your observations here. I'll admit that I don't fully understand the layers of configuration objects here. I do think there is an opportunity to consolidate some of this logic, but that's a different PR possibly for another time.
It's being used in AccumuloConfiguration.get(String) and AccumuloConfiguration.iterator
It's being used in
AccumuloConfiguration.get(String)andAccumuloConfiguration.iterator
That was a big code change for such a small convenience, I think. It doesn't save from having to iterate over the properties in AccumuloConfiguration.get(String) case, either (because we're still iterating in the new method body). And, in the AccumuloConfiguration.iterator case, I think there's a bug, because getProperties(entries) there now seems equivalent to getProperties(entries, x -> false); instead of getProperties(entries, x -> true);, because the varargs is empty (meaning no properties are specified and matched, instead of all of them matching).
In the best case scenario, it seems we're avoiding constructing a Predicate in two places in internal code in favor of constructing a varargs arrays instead. I'm not sure that's better in terms of performance, and there's a lot more code to get the job done. Overall, I don't think this change was better than what was there before.
It doesn't save from having to iterate over the properties in AccumuloConfiguration.get(String) case, either (because we're still iterating in the new method body)
We are iterating over the varargs method parameter and pulling the property value from the map of properties vs testing each entry in the map of properties to see if the Predicate returns true.
in the AccumuloConfiguration.iterator case, I think there's a bug, because getProperties(entries) there now seems equivalent to getProperties(entries, x -> false); instead of getProperties(entries, x -> true);, because the varargs is empty (meaning no properties are specified and matched, instead of all of them matching).
I handled this in DefaultConfiguration.getInstance, but I think you are right in that I missed this in the other classes, I will submit a follow-on to fix this.
In the best case scenario, it seems we're avoiding constructing a Predicate in two places in internal code in favor of constructing a varargs arrays instead. I'm not sure that's better in terms of performance, and there's a lot more code to get the job done. Overall, I don't think this change was better than what was there before.
I disagree. In the case where the caller wants to get the value for a single property (or even a few) with fully qualified names its certainly going to be more efficient to pull the properties from the map by key vs iterating over the entire map to test each key.
We are iterating over the varargs method parameter and pulling the property value from the map of properties vs testing each entry in the map of properties to see if the Predicate returns true.
Oh right. Still, this is a small scalar multiple difference. I don't think it will substantially affect performance, given that looking up a small number of properties this way would be rare anyway.
I disagree. In the case where the caller wants to get the value for a single property (or even a few) with fully qualified names its certainly going to be more efficient to pull the properties from the map by key vs iterating over the entire map to test each key.
Given your correction to my misunderstanding above, I agree with you. However, I don't think it matters that much in terms of performance. This is internal code, and it's unlikely we're hitting this often enough to need to optimize this specific case, especially since the caller can just do (pseudo-code): for x in A,B,C { conf.get(x) }
I don't think it will substantially affect performance, given that looking up a small number of properties this way would be rare anyway.
However, I don't think it matters that much in terms of performance
Well, I would point you to this comment above (https://github.com/apache/accumulo/pull/2834#issuecomment-1198394023). Anything that uses AccumuloConfiguration.get(String) is more efficient with this code. There are several places where get(String) is being used in the internal code. Here's one good example:
RFileOperations.openWriter() -> TableConfiguration.get(Property) -> NamespaceConfiguration.get(Property) -> AccumuloConfiguration.get(String).
I don't think it will substantially affect performance, given that looking up a small number of properties this way would be rare anyway.
However, I don't think it matters that much in terms of performance
Well, I would point you to this comment above (#2834 (comment)). Anything that uses
AccumuloConfiguration.get(String)is more efficient with this code. There are several places where get(String) is being used in the internal code. Here's one good example:
That comment reveals an entirely different issue... that is a bug, where the call to Namespace.get(Property) delegates to it's parent using getParent().get(String) instead of getParent().get(Property), like it should.
Retrieving by Property should be very efficient. Retrieving by arbitrary String should be rare, and almost never used. I believe you've optimized for a case that shouldn't be called instead of fixing the calling code.
I believe you've optimized for a case that shouldn't be called instead of fixing the calling code.
That's a fair observation. I didn't question the existing code. It may very well be that the existing code can be fixed and these two commits reverted.
@ctubbsii - are you working on anything related to this? If not, I can start on the things you mentioned above.
@dlmarion wrote:
@ctubbsii - are you working on anything related to this? If not, I can start on the things you mentioned above.
I haven't started anything yet. Feel free to look into it. I was going to fix the calling code, and then check to see if there were any other callers that needed to be similarly fixed, before even considering whether to revert these two commits. I did notice a test failure in NamespacesIT, after the second commit that was related to configuration property inheritance, and I think is related that needs to be investigated as well.
@ctubbsii - See https://github.com/apache/accumulo/pull/2864. I reverted the two commits and added a 3rd. In the 3rd commit I reduced calls to AccumuloConfiguration.get(String) and optimized that method.