How to specify or change configuration
I cannot find how to change the or specify the configuration, specifically the loading strategy. All the members are private and only a few mutators are exposed. The only place I can see to override is in the Connection() constructor -- but my existing app uses libraries that call the generic JDBC connect(). Would be useful to either expose all the configuration properties or to allow a global override of the factory.
Currently the loading strategy is not configurable. We've prepared for it to be configurable, that's why it's in ConnectionConfiguration. We just haven't figured out whether or not it should be configurable, and if it's going to be configurable, should it be possible to switch between the current default (bypass GetQueryResults and load straight from S3) and the GetQueryResults strategy, or if you're supposed to be able to plug in your own strategies.
The same goes for the polling strategy, which could be useful to be able to configure, but also useful to be able to replace with your own implementation in some cases.
What's your use case? Do you want to switch to using GetQueryResults or do you want to provide your own loading strategy?
First it is unclear to me still what the 'current' loadings strategy is -- the docs and the implementation seem misaligned. when I went to look I discovered its not changeable -- Whats lead to this was trying to debug why results were seeming to not get cached even though I had set the query token.
`By setting a client request token on a query execution you can make Athena reuse a previous result set if the exact same query has already been run. If you run the same query multiple times this can save money and improve performance.
`
I was rerunnign the same query and specifying the query token but it still took as long. What I believe now is I (and maybe you ? ) misunderstood what this does. Its not as it at first seemed -- a way to retrieve the results from cache from a prior query -- rather it seems more like a deduplication solution much like SQS receipt tokens -- to prevent accidentally issuing the same query due to failure on the client side when the server side had succeeded. IN none of the cases I tried does it do what i hoped and bypass the query and go right to the results from the last one that worked.
So .. on the path to see if I could implement that -- I thought maybe if I could get the S3 URL to the results then I could fetch it myself -- and yes I can -- but its not so useful without the metadata -- and I dont want to parse that when you have a parser already -- But I cant use it without also issuing a new query ...
So back to step one -- there is no obvious way to cache query results without making another copy first -- I was hoping to deduce a way to reuse the polling or results code via the various configurable features but found that in fact they are not configurable. But thats where I stopped -- I dont know if it is a short step or a long hike from there to be able to reuse previous query results, and didnt see an easy way to tell.
So to answer your question -- my use case is neither -- although maybe it might be both if they were configuable , hard to tell.
First it is unclear to me still what the 'current' loadings strategy is -- the docs and the implementation seem misaligned.
If there is a misalignment, it would be great to fix it, but as far as I can see, the documentation specifies that the default loading strategy is to load from S3, bypassing the Athena API, and the implementation selects the S3 loading strategy, that implements this behavior.
By setting a client request token on a query execution you can make Athena reuse a previous result set if the exact same query has already been run. If you run the same query multiple times this can save money and improve performance.
I was rerunnign the same query and specifying the query token but it still took as long. What I believe now is I (and maybe you ? ) misunderstood what this does. Its not as it at first seemed -- a way to retrieve the results from cache from a prior query -- rather it seems more like a deduplication solution much like SQS receipt tokens -- to prevent accidentally issuing the same query due to failure on the client side when the server side had succeeded. IN none of the cases I tried does it do what i hoped and bypass the query and go right to the results from the last one that worked.
The feature is indeed intended and described as to be used to ensure exactly-once processing, but we are actively using it for the caching benefit described in the documentation. Arguably, the README could be a bit clearer on what the original purpose of the token is, and that it can also be used for caching. We could probably also make it more clear that you must provide the same token every time you execute the query, including the first one. The way it is written now, it sounds as if you could provide a token to gain access to a previously executed query, which is not true.
Here is a specific use case for overloading configuration:
the application must assume a specific role to access Athena and S3 (which is different from the default role the process is running with).
The way to make it sort of work work with 4.0:
- create custom class
io.burt.athena.configuration.CustomConnectionConfigurationFactoryextendingConnectionConfigurationFactory, overriding thecreateConnectionConfigurationmethod, and inlining theConnectionConfigurationinterface there. - create custom class
io.burt.athena.CustomDataSourceextendingAthenaDataSourcethat takesConnectionConfigurationFactoryas an argument and passes it tosuperconstructor. - now you can create
CustomDataSourceinstead ofAthenaDataSourceand pass your custom connection configuration to it.
The default Athena driver, unfortunately, is auto-registering itself with default configuration upon class-load and therefore leaves no opportunity to inject a custom configuration. Original non-open-source Athena driver sort of dealt with this problem by having a configuration parameter that is a fully-qualified class name that would be doing configuration work. I'd argue this is pretty nasty and not a good way to do these things. There are many ways of dealing with configuration injection here, but none of them are decent. I'd say half-bad solution would be to have a base non-self-registering driver.
LMK if you want an MR for this.