amazon-s3-find-and-forget
amazon-s3-find-and-forget copied to clipboard
Libray used by FnF to create parquet file is different than spark uses.
The Library used by FnF is parquet-cpp-arrow version 7.0.0 and The library used by Spark is parquet-mr version 1.10.1.
Schema for timestamp is getting changed like below. Pre FnF:- ############ Column(datetime) ############ name: datetime path: datetime max_definition_level: 1 max_repetition_level: 0 physical_type: INT96 logical_type: None converted_type (legacy): NONE
Post FnF:- ############ Column(datetime) ############ name: datetime path: datetime max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false) converted_type (legacy): NONE
Do you see any issues in the future if Spark gets newer versions?

@matteofigus : Need your expertise here please @VAIBHAVTARANGE
Hello,
Just to understand your question better, is there any underlying context/issue that lead you to looking at this? i.e. unexpected changes to the data stored in the parquet file?
@ctd : Thank you for your quick response. Yes, we observed a small issue(which does not impact us badly) . Since Spark 3 uses a different calendar as per below JIRA. All dates/timestamps before 1900 I believe are impacted
https://issues.apache.org/jira/browse/SPARK-31404
We used the below workaround to read data through SPARK
https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-3-upgrade?view=sql-server-ver15 (Please refer section "SparkUpgradeException due to calendar mode change")
Hence, wanted to get your expert opinion if other issues like this might pop-up because parquet-mr s=vs parquet-arrow are 2 different libraries
@matteofigus
Hi @vivek-biradar @VAIBHAVTARANGE thanks for opening an issue. I am not very familiar with the scenario you mentioned, but I know that indeed manipulating date and times is risky due to compatibility issues, in fact we mention it in the production readiness docs: https://github.com/awslabs/amazon-s3-find-and-forget/blob/master/docs/PRODUCTION_READINESS_GUIDELINES.md#4-run-your-test-queries
Will there be any other issues? To be honest, I don't know but I think you are on the right path to find out. For each dataset my recommendation is to have a sample in a test account, perform a test deletion, and validate the schema of the output to ensure all systems you use to read are backward-compatible with the newly created object. After you perform the necessary testing, you can onboard the dataset in production.
@matteofigus : Thank you for the reply. We did test this on our test environment and are running on production based on that. Our question was more from a futuristic perspective around whether pyarrow and parquet-mr will be in sync in terms of the parquet format as they are now. I know its a hard question, but if you could get expert recommendation within the AWS team(probably EMR folks).
Again thank you for such quick response
@VAIBHAVTARANGE