aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Proper formatter for Athena parameterized queries
Is your feature request related to a problem? Please describe. The feature is related to #609, the main problem that users still need manually quoted values
Current workaround which I personally use, do not use params
at all for awswrangler.athena.read_sql_query
and awswrangler.athena.start_query_execution
. Instead of use PyAthena and their Formatter with just simple snippet:
from typing import Any, Dict, Optional
from pyathena.formatter import DefaultParameterFormatter
def prepare_query(operation: str, parameters: Optional[Dict[str, Any]] = None) -> str:
"""Dict of parameters that will be used for constructing the SQL query
Uses for prevent SQL Injection attack and help users to format parameter based on type
Note: ``awswrangler.athena.read_sql_query`` and ``awswrangler.athena.start_query_execution``
parameter ``params`` have limitations with formatting and escaping
Args:
operation: SQL operation with DB-API 2.0 paramstyle ``pyformat``: %(param)s
parameters: dictionary of SQL parameters
"""
if not parameters:
return operation
formatter = DefaultParameterFormatter()
return formatter.format(operation, parameters)
If we use snippet above with query
SELECT *
FROM spam.egg
WHERE %(bool_value)s
AND int_value=%(int_value)s
AND string_value=%(string_value)s
AND decimal_value=%(decimal_value)s
AND date_value=%(date_value)s
AND ts_value=%(ts_value)s
and parameters
{
"bool_value": True,
"int_value": 1,
"string_value": "'; --",
"decimal_value": decimal.Decimal("99999999999999999999.333333333333"),
"date_value": datetime.date(1987, 12, 21),
"ts_value": datetime.datetime(2000, 1, 1, 0, 0, 0, 0),
}
turned into
SELECT *
FROM spam.egg
WHERE True
AND int_value=1
AND string_value='''; --'
AND decimal_value=DECIMAL '99999999999999999999.333333333333'
AND date_value=DATE '1987-12-21'
AND ts_value=TIMESTAMP '2000-01-01 00:00:00.000'
In the other case with current aws-data-wrangle==2.9.0
with query and same parameters
SELECT *
FROM spam.egg
WHERE :bool_value;
AND int_value=:int_value;
AND string_value=:string_value;
AND decimal_value=:decimal_value;
AND date_value=:date_value;
AND ts_value=:ts_value;
We got
SELECT *
FROM spam.egg
WHERE True
AND int_value=1
AND string_value=\'; --
AND decimal_value=99999999999999999999.333333333333
AND date_value=1987-12-21
AND ts_value=2000-01-01 00:00:00
Again workaround exists and also users can prepare parameters by their own (escaping quotation) as it mention in documentation.
However It would be nice if this feature exists into aws-data-wrangler
out of the box.
Describe the solution you'd like I don't think there is good solutions (aka silver bullet) for now
Solution 1
Integrate aws-data-wrangler
with PyAthena
pros:
- Easy to implement
cons:
- Additional dependency
- Breaking changes. Users required to change their code: different parameter type with PyAthena, already quoted strings will quoted again
Solution 2
Make some own solution for formatting and escaping values, add to awswrangler.athena.read_sql_query
and awswrangler.athena.start_query_execution
parameter params_format: bool = False
and apply it only if set to True
Sample snippet
def escaped_param(param: str):
param = param.replace("\\", r"\\").replace("'", r"\'").replace("\r", r"\r")\
.replace("\n", r"\n").replace("\t", r"\t")
return f"'{param}'"
def formatted_params(value: Any) -> str:
""" Format values based on type
Parameters
----------
value : Any
parameter value.
Returns
-------
str
Formatted and escaped value
Examples
--------
>>> formatted_params("string")
"'string'"
>>> formatted_params("I\'m string")
"'I\\\\'m string'"
>>> formatted_params(decimal.Decimal("0.999999999"))
"DECIMAL '0.999999999'"
>>> formatted_params(True)
'TRUE'
>>> formatted_params(None)
'NULL'
>>> formatted_params([1, 2, None])
'ARRAY[1, 2, NULL]'
>>> formatted_params(datetime.date(1987,12,21))
"DATE '1987-12-21'"
>>> formatted_params(datetime.datetime(2000,1,1,0,0,0,0))
"TIMESTAMP '2000-01-01 00:00:00.000'"
"""
if isinstance(value, str):
return escaped_param(value)
if isinstance(value, bool):
return str(value).upper()
if isinstance(value, int):
return str(value)
if isinstance(value, float):
return f"{value:f}"
if isinstance(value, decimal.Decimal):
return f"DECIMAL '{value:f}'"
if isinstance(value, datetime.datetime):
if value.tzinfo is None: # timezone aware
return f"TIMESTAMP '{value.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}'"
else:
raise ValueError(f"Support only timezone aware datatype but got {value.tzinfo}")
if isinstance(value, datetime.date):
return f"DATE '{value.strftime('%Y-%m-%d')}'"
if isinstance(value, (list, tuple)):
return f"ARRAY[{', '.join(formatted_params(x) for x in value)}]"
if value is None:
return "NULL"
raise ValueError(f"Unsupported type {value.__class}")
def prepare_sql_params(params: Dict[str, Any]) -> Dict[str, str]:
return {k: formatted_params(v) for k, v in params.items()}
pros:
- independent of
PyAthena
- Previous code will work, as soon as
params_format = False
cons:
- Required additional tests
It might also make sense to directly use the new released parameterized queries support for Athena: https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-athena-adds-parameterized-queries-to-improve-reusability-and-security/?nc1=h_ls
It might also make sense to directly use the new released parameterized queries support for Athena: https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-athena-adds-parameterized-queries-to-improve-reusability-and-security/?nc1=h_ls
I've checked. PREPARE statement pretty nice, however it also required escaping and formatting.
Create some dummy PREPARE statements
import boto3
session = boto3.session.Session()
client = session.client("athena")
client.create_prepared_statement(
StatementName="predicate_int",
WorkGroup="aws_data_wrangler_0",
Description="Use integer as predicate",
QueryStatement="""SELECT * FROM (
VALUES (1, 'a', DATE '2021-01-01'),
(2, 'b', DATE '2021-01-02'),
(3, 'c', DATE '2021-01-03')
) AS t (id, name, dt)
WHERE id = ?"""
)
client.create_prepared_statement(
StatementName="predicate_string",
WorkGroup="aws_data_wrangler_0",
Description="Use string as predicate",
QueryStatement="""SELECT * FROM (
VALUES (1, 'a', DATE '2021-01-01'),
(2, 'b', DATE '2021-01-02'),
(3, 'c', DATE '2021-01-03')
) AS t (id, name, dt)
WHERE name = ?"""
)
client.create_prepared_statement(
StatementName="predicate_date",
WorkGroup="aws_data_wrangler_0",
Description="Use date as predicate",
QueryStatement="""SELECT * FROM (
VALUES (1, 'a', DATE '2021-01-01'),
(2, 'b', DATE '2021-01-02'),
(3, 'c', DATE '2021-01-03')
) AS t (id, name, dt)
WHERE dt = ?"""
)
Still required formatting, it work fine for numeric type like int
and float
predicate_int_exec_id = client.start_query_execution(
WorkGroup="aws_data_wrangler_0",
QueryString="EXECUTE predicate_int USING ({0})".format(1)
)["QueryExecutionId"]
status = client.get_query_execution(
QueryExecutionId=predicate_int_exec_id
)["QueryExecution"]["Status"]
print(status["State"], status.get("StateChangeReason"))
# SUCCEEDED None
However for string it required quotations and escaping
predicate_string_exec_id = client.start_query_execution(
WorkGroup="aws_data_wrangler_0",
QueryString="EXECUTE predicate_string USING ({0})".format("a")
)["QueryExecutionId"]
status = client.get_query_execution(
QueryExecutionId=predicate_string_exec_id
)["QueryExecution"]["Status"]
print(status["State"], status.get("StateChangeReason"))
# FAILED SYNTAX_ERROR: line 1:33: Constant expression cannot contain column references
predicate_string_exec_id = client.start_query_execution(
WorkGroup="aws_data_wrangler_0",
QueryString="EXECUTE predicate_string USING ('{0}')".format("a")
)["QueryExecutionId"]
status = client.get_query_execution(
QueryExecutionId=predicate_string_exec_id
)["QueryExecution"]["Status"]
print(status["State"], status.get("StateChangeReason"))
# SUCCEEDED None
value = "I'm string"
# Will raise: # InvalidRequestException: An error occurred (InvalidRequestException)
# when calling the StartQueryExecution operation: line 1:36: mismatched input 'm'.
# Expecting: <expression>
predicate_string_exec_id = client.start_query_execution(
WorkGroup="aws_data_wrangler_0",
QueryString="EXECUTE predicate_string USING ('{0}')".format(value)
)["QueryExecutionId"]
And also some problem with other types like date
import datetime
predicate_date_exec_id = client.start_query_execution(
WorkGroup="aws_data_wrangler_0",
QueryString="EXECUTE predicate_date USING ('{0}')".format(datetime.date(2021, 1, 1))
)["QueryExecutionId"]
status = client.get_query_execution(
QueryExecutionId=predicate_date_exec_id
)["QueryExecution"]["Status"]
print(status["State"], status.get("StateChangeReason"))
# FAILED SYNTAX_ERROR: line 9:11: '=' cannot be applied to date, varchar(10)
And also I could not find information about quotas for prepared statement
@Taragolis, first of all thank you ever so much for this feature request description. It's very clear and we appreciate the level of detail.
Here are our thoughts on the subject:
- PyAthena is an interesting project and we are considering it as an additional dependency. It will depend on whether we can extract more from it than just the formatter. The fact that it would introduce breaking changes might be an issue though. One option could be to deprecate the existing
params
field in favor of a new one but we would prefer not to. Will update this thread once we have deep dived the library. We would only create our custom formatter as a last resort I believe - It seems that the latest Athena parameterized query feature does not handle formatting so would not address that piece. However, we will include it as separate methods as it could be useful for other purposes
PyAthena solution is still better compared to currently existing parametrization in wrangler, as the latter doesn't seem to do more than str.format(...)
in python
@Taragolis thanks for taking your time to describe the issue and proposing pyathena, saved me a significant time.
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.
It seems that the latest Athena parameterized query feature does not handle formatting so would not address that piece. However, we will include it as separate methods as it could be useful for other purposes
@jaidisido Curious if this was implemented somewhere.
Not yet - we still hope to implement a proper formatter across the library but must say it has slipped through other priorities
Can't speak for everyone, but in my practice parametrized queries are frequently the only approved way to add parameters from code (and IMO the most reliable one).
I had to go with pyathena's formatting because of this issue
Same here!
@jaidisido @cnfait - I found wrangler 3.0 in Milestones So, probably it is a good time to change formatter for Athena queries (breaking changes)? I could work on this feature
Contributions are always welcome @Taragolis, and we could indeed consider a breaking change for 3.0.0 but please let us know in which direction you would like to take this so we can collaborate on it, cheers!
I will try to create built-in formatter just for avoid additional imports
I will try to create built-in formatter just for avoid additional imports
Agreed - that is most likely the right approach, perhaps you can take inspiration from the one in pyathena?