aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Disable Scan selects for parquet files
Feature or Bugfix
- Bugfix
Detail
- S3 SELECT queries with a Scan range don't really make sense for parquet files, that's why we should fall back to the "defazlt" query in this case as well
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
AWS CodeBuild CI Report
- CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
- Commit ID: 2d4c155b750bd1af5dfc1a5f603fa1aa477b735b
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Hi @maxispeicher, what makes you think that scan ranges don't make sense with Parquet? The docs mention that it can be used for parquet:
Amazon S3 Select scan range requests support Parquet, CSV (without quoted delimiters), and JSON objects (in LINES mode only). CSV and JSON objects must be uncompressed. For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed.
My tests also show that it works and is faster than reading the entire object
True according to the docs it works. I was just looking at the comment in the code and thought that it fits nicely with my observations but then I guess it's more due to the query. I basically wanted to get the max value of a column like this:
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT MAX(c1) FROM s3object s",
path=s3_uri,
input_serialization="Parquet",
input_serialization_params={},
use_threads=False
)
In my case (the columns has quite some NaNs), the result is a DataFrame with 40 rows where only the first row is the maximum and all the other rows are NaN
. I would need to check if the column would be all valid values then each row would be the maximum of the scan range or something else.
It also seems to happen for all valid rows. Reproducible example:
import awswrangler as wr
import pandas as pd
s3_uri = "tbd"
df = pd.DataFrame({"c1": [i for i in range(1_000_000)]})
wr.s3.to_parquet(df=df, path=s3_uri)
df_res = wr.s3.select_query(
sql="SELECT MAX(c1) FROM s3object s",
path=s3_uri,
input_serialization="Parquet",
input_serialization_params={},
)
print(df_res)
will output
_1
0 999999
1 None
2 None
3 None
4 None
All in all a big problem, but maybe then it might make sense to check if there is any aggregate function (AVG
, COUNT
, MAX
, MIN
, SUM
) in the query.
Ah I see, did not realise you were using MAX in the query. So here is what happens:
- the same query request ("SELECT MAX(c1)...") is made to the S3 Select engine but across multiple scan ranges (it's more performant since those requests are parallelised in threads)
- One of the scan ranges has the actual max for the file so it returns that. All others wouldn't have it, so they return null We then collect and concatenate the results from all scan ranges and return them in a df
I guess there are a few options here, one is to enable the user to disable the scan ranges, the other is to handle aggregate functions separately... The other complication is that in the future we want to implement S3 Select across multiple S3 objects, not just one... but aggregate functions wouldn't work at that point
Yes this makes sense to me. Do you know how this actually works internally? Meaning how do you know if it is the maximum of the file if you only look like at the specified range? :thinking: So it feels like even with a specified scan range you need to check more of the file to see if it's really the maximum.
I think for now it might also be fine to just mention that case in the docs. I can do this, if you agree that this is enough. Or you could just drop nans when the query is an aggregation query.
I was just somehow puzzled by the return and didn't invest to much time thinking about it, so thanks for your input :slightly_smiling_face:
Yeah a good question I asked myself as well. I don't know the internals of the S3 select engine to be honest, but my guess would be that the MAX query is still resolved on the entire file by the engine, however the results are sent back across the scan ranges.
I agree I think a simple mention in the documentation is enough for now...
@jaidisido Or do you think it would be a good idea to just to one query on the complete object where use_threads
is set to False
?
True, it would indeed work if we make use_threads=False
work on an entire file instead of multiple scan ranges. The only concern I have at that point is how unintuitive it would be for users to understand why different values of use_threads
are returning different results for aggregate functions such as MAX. But perhaps clarifying that in the docs would be enough?