aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

Proper formatter for Athena parameterized queries

Open Taragolis opened this issue 3 years ago • 13 comments

Is your feature request related to a problem? Please describe. The feature is related to #609, the main problem that users still need manually quoted values

Current workaround which I personally use, do not use params at all for awswrangler.athena.read_sql_query and awswrangler.athena.start_query_execution. Instead of use PyAthena and their Formatter with just simple snippet:

from typing import Any, Dict, Optional

from pyathena.formatter import DefaultParameterFormatter


def prepare_query(operation: str, parameters: Optional[Dict[str, Any]] = None) -> str:
    """Dict of parameters that will be used for constructing the SQL query
    Uses for prevent SQL Injection attack and help users to format parameter based on type

    Note: ``awswrangler.athena.read_sql_query`` and ``awswrangler.athena.start_query_execution``
          parameter ``params`` have limitations with formatting and escaping

    Args:
        operation: SQL operation with DB-API 2.0 paramstyle ``pyformat``: %(param)s
        parameters: dictionary of SQL parameters
    """
    if not parameters:
        return operation

    formatter = DefaultParameterFormatter()
    return formatter.format(operation, parameters)

If we use snippet above with query

SELECT * 
FROM spam.egg 
WHERE %(bool_value)s
  AND int_value=%(int_value)s
  AND string_value=%(string_value)s
  AND decimal_value=%(decimal_value)s
  AND date_value=%(date_value)s
  AND ts_value=%(ts_value)s

and parameters

{
    "bool_value": True,
    "int_value": 1,
    "string_value": "'; --",
    "decimal_value": decimal.Decimal("99999999999999999999.333333333333"),
     "date_value": datetime.date(1987, 12, 21),
    "ts_value": datetime.datetime(2000, 1, 1, 0, 0, 0, 0),
}

turned into

SELECT * 
FROM spam.egg 
WHERE True
  AND int_value=1
  AND string_value='''; --'
  AND decimal_value=DECIMAL '99999999999999999999.333333333333'
  AND date_value=DATE '1987-12-21'
  AND ts_value=TIMESTAMP '2000-01-01 00:00:00.000'

In the other case with current aws-data-wrangle==2.9.0 with query and same parameters

SELECT * 
FROM spam.egg
WHERE :bool_value;
  AND int_value=:int_value;
  AND string_value=:string_value;
  AND decimal_value=:decimal_value;
  AND date_value=:date_value;
  AND ts_value=:ts_value;

We got

SELECT * 
FROM spam.egg 
WHERE True
  AND int_value=1
  AND string_value=\'; --
  AND decimal_value=99999999999999999999.333333333333
  AND date_value=1987-12-21
  AND ts_value=2000-01-01 00:00:00

Again workaround exists and also users can prepare parameters by their own (escaping quotation) as it mention in documentation. However It would be nice if this feature exists into aws-data-wrangler out of the box.

Describe the solution you'd like I don't think there is good solutions (aka silver bullet) for now

Solution 1 Integrate aws-data-wrangler with PyAthena

pros:

  • Easy to implement

cons:

  • Additional dependency
  • Breaking changes. Users required to change their code: different parameter type with PyAthena, already quoted strings will quoted again

Solution 2 Make some own solution for formatting and escaping values, add to awswrangler.athena.read_sql_query and awswrangler.athena.start_query_execution parameter params_format: bool = False and apply it only if set to True

Sample snippet

def escaped_param(param: str):
    param = param.replace("\\", r"\\").replace("'", r"\'").replace("\r", r"\r")\
                 .replace("\n", r"\n").replace("\t", r"\t")
    return f"'{param}'"


def formatted_params(value: Any) -> str:
    """ Format values based on type

    Parameters
    ----------
    value : Any
        parameter value.

    Returns
    -------
    str
        Formatted and escaped value

    Examples
    --------

    >>> formatted_params("string")
    "'string'"
    >>> formatted_params("I\'m string")
    "'I\\\\'m string'"
    >>> formatted_params(decimal.Decimal("0.999999999"))
    "DECIMAL '0.999999999'"
    >>> formatted_params(True)
    'TRUE'
    >>> formatted_params(None)
    'NULL'
    >>> formatted_params([1, 2, None])
    'ARRAY[1, 2, NULL]'
    >>> formatted_params(datetime.date(1987,12,21))
    "DATE '1987-12-21'"
    >>> formatted_params(datetime.datetime(2000,1,1,0,0,0,0))
    "TIMESTAMP '2000-01-01 00:00:00.000'"
    """

    if isinstance(value, str):
        return escaped_param(value)
    if isinstance(value, bool):
        return str(value).upper()
    if isinstance(value, int):
        return str(value)
    if isinstance(value, float):
        return f"{value:f}"
    if isinstance(value, decimal.Decimal):
        return f"DECIMAL '{value:f}'"
    if isinstance(value, datetime.datetime):
        if value.tzinfo is None:  # timezone aware
            return f"TIMESTAMP '{value.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}'"
        else:
            raise ValueError(f"Support only timezone aware datatype but got {value.tzinfo}")
    if isinstance(value, datetime.date):
        return f"DATE '{value.strftime('%Y-%m-%d')}'"
    if isinstance(value, (list, tuple)):
        return f"ARRAY[{', '.join(formatted_params(x) for x in value)}]"
    if value is None:
        return "NULL"

    raise ValueError(f"Unsupported type {value.__class}")

def prepare_sql_params(params: Dict[str, Any]) -> Dict[str, str]:
    return {k: formatted_params(v) for k, v in params.items()}

pros:

  • independent of PyAthena
  • Previous code will work, as soon as params_format = False

cons:

  • Required additional tests

Taragolis avatar Jul 10 '21 11:07 Taragolis

It might also make sense to directly use the new released parameterized queries support for Athena: https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-athena-adds-parameterized-queries-to-improve-reusability-and-security/?nc1=h_ls

maxispeicher avatar Jul 12 '21 11:07 maxispeicher

It might also make sense to directly use the new released parameterized queries support for Athena: https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-athena-adds-parameterized-queries-to-improve-reusability-and-security/?nc1=h_ls

I've checked. PREPARE statement pretty nice, however it also required escaping and formatting.

Create some dummy PREPARE statements

import boto3
session = boto3.session.Session()
client = session.client("athena")

client.create_prepared_statement(
    StatementName="predicate_int",
    WorkGroup="aws_data_wrangler_0",
    Description="Use integer as predicate",
    QueryStatement="""SELECT * FROM (
                      VALUES (1, 'a', DATE '2021-01-01'),
                             (2, 'b', DATE '2021-01-02'),
                             (3, 'c', DATE '2021-01-03')
                      ) AS t (id, name, dt)
                       WHERE id = ?"""
)

client.create_prepared_statement(
    StatementName="predicate_string",
    WorkGroup="aws_data_wrangler_0",
    Description="Use string as predicate",
    QueryStatement="""SELECT * FROM (
                      VALUES (1, 'a', DATE '2021-01-01'),
                             (2, 'b', DATE '2021-01-02'),
                             (3, 'c', DATE '2021-01-03')
                      ) AS t (id, name, dt)
                       WHERE name = ?"""
)

client.create_prepared_statement(
    StatementName="predicate_date",
    WorkGroup="aws_data_wrangler_0",
    Description="Use date as predicate",
    QueryStatement="""SELECT * FROM (
                      VALUES (1, 'a', DATE '2021-01-01'),
                             (2, 'b', DATE '2021-01-02'),
                             (3, 'c', DATE '2021-01-03')
                      ) AS t (id, name, dt)
                       WHERE dt = ?"""
)

Still required formatting, it work fine for numeric type like int and float

predicate_int_exec_id = client.start_query_execution(
    WorkGroup="aws_data_wrangler_0",
    QueryString="EXECUTE predicate_int USING ({0})".format(1)
)["QueryExecutionId"]

status = client.get_query_execution(
    QueryExecutionId=predicate_int_exec_id
)["QueryExecution"]["Status"]

print(status["State"], status.get("StateChangeReason"))
# SUCCEEDED None

However for string it required quotations and escaping

predicate_string_exec_id = client.start_query_execution(
    WorkGroup="aws_data_wrangler_0",
    QueryString="EXECUTE predicate_string USING ({0})".format("a")
)["QueryExecutionId"]

status = client.get_query_execution(
    QueryExecutionId=predicate_string_exec_id
)["QueryExecution"]["Status"]

print(status["State"], status.get("StateChangeReason"))
# FAILED SYNTAX_ERROR: line 1:33: Constant expression cannot contain column references

predicate_string_exec_id = client.start_query_execution(
    WorkGroup="aws_data_wrangler_0",
    QueryString="EXECUTE predicate_string USING ('{0}')".format("a")
)["QueryExecutionId"]

status = client.get_query_execution(
    QueryExecutionId=predicate_string_exec_id
)["QueryExecution"]["Status"]

print(status["State"], status.get("StateChangeReason"))
# SUCCEEDED None

value = "I'm string"

# Will raise: # InvalidRequestException: An error occurred (InvalidRequestException) 
# when calling the StartQueryExecution operation: line 1:36: mismatched input 'm'. 
# Expecting: <expression>
predicate_string_exec_id = client.start_query_execution(
    WorkGroup="aws_data_wrangler_0",
    QueryString="EXECUTE predicate_string USING ('{0}')".format(value)
)["QueryExecutionId"]

And also some problem with other types like date

import datetime

predicate_date_exec_id = client.start_query_execution(
    WorkGroup="aws_data_wrangler_0",
    QueryString="EXECUTE predicate_date USING ('{0}')".format(datetime.date(2021, 1, 1))
)["QueryExecutionId"]

status = client.get_query_execution(
    QueryExecutionId=predicate_date_exec_id
)["QueryExecution"]["Status"]

print(status["State"], status.get("StateChangeReason"))
# FAILED SYNTAX_ERROR: line 9:11: '=' cannot be applied to date, varchar(10)

And also I could not find information about quotas for prepared statement

Taragolis avatar Jul 13 '21 09:07 Taragolis

@Taragolis, first of all thank you ever so much for this feature request description. It's very clear and we appreciate the level of detail.

Here are our thoughts on the subject:

  1. PyAthena is an interesting project and we are considering it as an additional dependency. It will depend on whether we can extract more from it than just the formatter. The fact that it would introduce breaking changes might be an issue though. One option could be to deprecate the existing params field in favor of a new one but we would prefer not to. Will update this thread once we have deep dived the library. We would only create our custom formatter as a last resort I believe
  2. It seems that the latest Athena parameterized query feature does not handle formatting so would not address that piece. However, we will include it as separate methods as it could be useful for other purposes

jaidisido avatar Jul 15 '21 16:07 jaidisido

PyAthena solution is still better compared to currently existing parametrization in wrangler, as the latter doesn't seem to do more than str.format(...) in python

@Taragolis thanks for taking your time to describe the issue and proposing pyathena, saved me a significant time.

arogozhnikov avatar Feb 02 '22 02:02 arogozhnikov

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar May 07 '22 15:05 github-actions[bot]

It seems that the latest Athena parameterized query feature does not handle formatting so would not address that piece. However, we will include it as separate methods as it could be useful for other purposes

@jaidisido Curious if this was implemented somewhere.

arogozhnikov avatar May 07 '22 17:05 arogozhnikov

Not yet - we still hope to implement a proper formatter across the library but must say it has slipped through other priorities

jaidisido avatar May 09 '22 09:05 jaidisido

Can't speak for everyone, but in my practice parametrized queries are frequently the only approved way to add parameters from code (and IMO the most reliable one).

I had to go with pyathena's formatting because of this issue

arogozhnikov avatar May 09 '22 16:05 arogozhnikov

Same here!

olly-writes-code avatar May 09 '22 17:05 olly-writes-code

@jaidisido @cnfait - I found wrangler 3.0 in Milestones So, probably it is a good time to change formatter for Athena queries (breaking changes)? I could work on this feature

Taragolis avatar Jun 10 '22 14:06 Taragolis

Contributions are always welcome @Taragolis, and we could indeed consider a breaking change for 3.0.0 but please let us know in which direction you would like to take this so we can collaborate on it, cheers!

jaidisido avatar Jun 13 '22 08:06 jaidisido

I will try to create built-in formatter just for avoid additional imports

Taragolis avatar Jun 16 '22 21:06 Taragolis

I will try to create built-in formatter just for avoid additional imports

Agreed - that is most likely the right approach, perhaps you can take inspiration from the one in pyathena?

jaidisido avatar Jun 17 '22 10:06 jaidisido