pdb icon indicating copy to clipboard operation
pdb copied to clipboard

Performance: Two database queries are executed for each invocation of query()

Open jprafael opened this issue 9 years ago • 7 comments

When invoking the AbstractDatabaseEngine#query(Expression expr) method two database queries are executed.

The first call is invoked with the following call stack:

AbstractDatabaseEngine#checkConnection()
AbstractDatabaseEngine#getConnection()
AbstractDatabaseEngine#iterator(String query, int fetchSize)

This query is only used to check the aliveness of the connection. Because of this call each query() incurs in twice the round-trip time. It is also somewhat useless since it doesn't guarantee that the intended operation will succeed because the connection might be dropped meanwhile.

jprafael avatar Apr 20 '15 17:04 jprafael

The idea behind this check is that the probability of a failure during the execution of a query (not only iterators) should be very similar to the probability of the failure of the check connection. With this we could retry the connection before starting executing the desired operation.

It is true that the rtt is doubled, the question is that if that impacts for real the performance of your application

diogoeag avatar Apr 21 '15 10:04 diogoeag

For me it seems that makes sense to add an option to control whether you want or not for that check to be performed. That might some scenarios where reconnection is not needed and it's not important to recover.

On 21 April 2015 at 12:11, diogoeag [email protected] wrote:

The idea behind this check is that the probability of a failure during the execution of a query (not only iterators) should be very similar to the probability of the failure of the check connection. With this we could retry the connection before starting executing the desired operation.

It is true that the rtt is doubled, the question is that if that impacts for real the performance of your application

— Reply to this email directly or view it on GitHub https://github.com/feedzai/pdb/issues/14#issuecomment-94730020.

rpvilao avatar Apr 21 '15 10:04 rpvilao

I think it could be optimistic, try the query. If it fails, just do the check, retries and try again.

Then again if the rtt is a problem, there's probably something wrong with the client design anyway because it should be either batching for writes or reading asynchronously while other work completes.

defer avatar Apr 21 '15 11:04 defer

@defer, @diogoeag PDB has no (truly) asynchronous methods. We are mimicking this using a separate thread-pool. But note that query() is not a cheap call. Our profiling shows that on a total of 16 seconds spent inside the checkConnection() function, only 4s are spent blocking for IO. The remaining 12 seconds are spent on the driver's internal functions and will keep a CPU thread busy. This prevents us from simply raising the number of processing threads until it is no longer a problem as it would degrade the performance on other components.

More importantly, the response latency is doubled which is a problem when PDB is used in a realtime environment. In this scenario batching is also sub-optimal because you either use very small batches and the RTT becomes a problem, or you use larger batches and the batch fill time itself becomes a problem.

I agree we should be optimistic and use the available connection. I would actually go a bit further and say that reconnects / retries should not be the responsibility of query() but of its caller (which might have specialized recovery/failure behavior).

jprafael avatar Apr 21 '15 12:04 jprafael

@jprafael these 16s are during what period of profilling? If its 20seconds its one thing, if its one day, I wouldn't care about it. Note that some of our HA capabilities (the ones related to database connections) are strictly tied to the feature of retry / recover of PDB. That will not change for now.

Can you elaborate on the real weight of this on the overall process?

diogoeag avatar Apr 21 '15 12:04 diogoeag

I'm not entirely sure you should be depending on pdb for latency sensitive workloads. However, if completion of the task is not a requirement for the caller to finish then yeah, either pdb or the client should behave asynchronously.

That being said, 12s of internal work seems odd and should probably be looked at. Do you have more granular data on what is taking that long?

defer avatar Apr 21 '15 13:04 defer

@diogoeag the JVM was running for 8 mins but the application was running only on the last 4 mins (wall time) of profiling on the entire application. Total thread time amounts to 15000s but is mostly irrelevant since most threads are blocking/waiting. Total executing time of the application however is 914s. So this amounts to roughly 1.64% of the application time.

@defer It is mostly oracle.jdbc.driver.* related. I can upload an HTML export of the measurements of this particular function. Where should I send it to?

jprafael avatar Apr 21 '15 14:04 jprafael