aws-sdk-pandas
aws-sdk-pandas copied to clipboard
HIVE_CANNOT_OPEN_SPLIT NoSuchKey Error when ingesting iceberg data in parallel
I'm using awswrangler==3.4.2 wr.athena.to_iceberg to ingest ~100 text files from S3 into an Iceberg table using parallel lambdas
wr.athena.to_iceberg(
df=df,
database=database,
table=table_name,
partition_cols=partition_cols,
table_location=s3_location,
temp_path=f"s3://{bucket}/{database}/temp/{table_name}",
keep_files=False,
)
but I've encountered
[ERROR] QueryFailed: HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket/prefix/temp/events/03a20c57ae9642378ad1f829f390613a.snappy.parquet (offset=0, length=504731): io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: EKWQ310403FNNWKV; S3 Extended Request ID: qIu8bN3U+alq/s
The file mentioned as NoSuchKey actually exists and I can see it in S3, albeit with a deleted marker associated with it.
Anyone has hints of what may be going wrong?
I found in the implementation that after upserting the rows, it deletes the whole folder and not just the respective file
if keep_files is False:
s3.delete_objects(
path=temp_path or wg_config.s3_output, # type: ignore[arg-type]
boto3_session=boto3_session,
s3_additional_kwargs=s3_additional_kwargs,
)
so that may explain.. but what would be the best practice here?
The fact that you are using parallel lambdas makes me think that there might be a race condition between them where an object is deleted by one while the other still requires it. You might want to have separate temporary prefixes to avoid this or preserve the files with keep_files=True
It looks to me Iceberg is not very robust wrt parallel writing; keep_files=True leads to duplication upon subsequent loads so I tried appending the nano timestamp to the temporary path so it's unique but now I have ICEBERG_COMMIT_ERROR error due to this
@khoatrandata how did you overcome the above error for parallel lambdas ?
@B161851 , at first I tried invoking the ingestion lambdas synchronously in a sequence for backfill but then that's too slow. I ended up using a crawler to create a glue table over the raw data and use sql insert into... iceberg which is quicker.
@khoatrandata can you post that code snippet, as I am tried to followed you above answer but I couldn't make it. Can it possible to write to the iceberg table using the lambda ?.
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.