aws-sdk-pandas Disable Scan selects for parquet files

Feature or Bugfix

Bugfix

Detail

S3 SELECT queries with a Scan range don't really make sense for parquet files, that's why we should fall back to the "defazlt" query in this case as well

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Jul 14 '22 13:07 maxispeicher

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 2d4c155b750bd1af5dfc1a5f603fa1aa477b735b
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Jul 14 '22 14:07 malachi-constant

Hi @maxispeicher, what makes you think that scan ranges don't make sense with Parquet? The docs mention that it can be used for parquet:

Amazon S3 Select scan range requests support Parquet, CSV (without quoted delimiters), and JSON objects (in LINES mode only). CSV and JSON objects must be uncompressed. For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed.

My tests also show that it works and is faster than reading the entire object

Jul 14 '22 14:07 jaidisido

True according to the docs it works. I was just looking at the comment in the code and thought that it fits nicely with my observations but then I guess it's more due to the query. I basically wanted to get the max value of a column like this:

import awswrangler as wr

df = wr.s3.select_query(
        sql="SELECT MAX(c1) FROM s3object s",
        path=s3_uri,
        input_serialization="Parquet",
        input_serialization_params={},
        use_threads=False
)

In my case (the columns has quite some NaNs), the result is a DataFrame with 40 rows where only the first row is the maximum and all the other rows are NaN. I would need to check if the column would be all valid values then each row would be the maximum of the scan range or something else.

Jul 14 '22 14:07 maxispeicher

It also seems to happen for all valid rows. Reproducible example:

import awswrangler as wr
import pandas as pd

s3_uri = "tbd"

df = pd.DataFrame({"c1": [i for i in range(1_000_000)]})
wr.s3.to_parquet(df=df, path=s3_uri)

df_res = wr.s3.select_query(
        sql="SELECT MAX(c1) FROM s3object s",
        path=s3_uri,
        input_serialization="Parquet",
        input_serialization_params={},
)
print(df_res)

will output

       _1
0  999999
1    None
2    None
3    None
4    None

All in all a big problem, but maybe then it might make sense to check if there is any aggregate function (AVG, COUNT, MAX, MIN, SUM) in the query.

Jul 14 '22 15:07 maxispeicher

Ah I see, did not realise you were using MAX in the query. So here is what happens:

the same query request ("SELECT MAX(c1)...") is made to the S3 Select engine but across multiple scan ranges (it's more performant since those requests are parallelised in threads)
One of the scan ranges has the actual max for the file so it returns that. All others wouldn't have it, so they return null We then collect and concatenate the results from all scan ranges and return them in a df

I guess there are a few options here, one is to enable the user to disable the scan ranges, the other is to handle aggregate functions separately... The other complication is that in the future we want to implement S3 Select across multiple S3 objects, not just one... but aggregate functions wouldn't work at that point

Jul 14 '22 15:07 jaidisido

Yes this makes sense to me. Do you know how this actually works internally? Meaning how do you know if it is the maximum of the file if you only look like at the specified range? :thinking: So it feels like even with a specified scan range you need to check more of the file to see if it's really the maximum.

I think for now it might also be fine to just mention that case in the docs. I can do this, if you agree that this is enough. Or you could just drop nans when the query is an aggregation query.

I was just somehow puzzled by the return and didn't invest to much time thinking about it, so thanks for your input :slightly_smiling_face:

Jul 14 '22 16:07 maxispeicher

Yeah a good question I asked myself as well. I don't know the internals of the S3 select engine to be honest, but my guess would be that the MAX query is still resolved on the entire file by the engine, however the results are sent back across the scan ranges.

I agree I think a simple mention in the documentation is enough for now...

Jul 14 '22 16:07 jaidisido

@jaidisido Or do you think it would be a good idea to just to one query on the complete object where use_threads is set to False?

Jul 14 '22 20:07 maxispeicher

True, it would indeed work if we make use_threads=False work on an entire file instead of multiple scan ranges. The only concern I have at that point is how unintuitive it would be for users to understand why different values of use_threads are returning different results for aggregate functions such as MAX. But perhaps clarifying that in the docs would be enough?

Jul 15 '22 13:07 jaidisido

aws-sdk-pandas aws-sdk-pandas copied to clipboard

Disable Scan selects for parquet files

Feature or Bugfix

Detail

AWS CodeBuild CI Report

aws-sdk-pandas
aws-sdk-pandas copied to clipboard