aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

HIVE_CANNOT_OPEN_SPLIT NoSuchKey Error when ingesting iceberg data in parallel

Open khoatrandata opened this issue 1 year ago • 6 comments

I'm using awswrangler==3.4.2 wr.athena.to_iceberg to ingest ~100 text files from S3 into an Iceberg table using parallel lambdas

wr.athena.to_iceberg(
    df=df,
    database=database,
    table=table_name,
    partition_cols=partition_cols,
    table_location=s3_location,
    temp_path=f"s3://{bucket}/{database}/temp/{table_name}",
    keep_files=False,
)

but I've encountered

[ERROR] QueryFailed: HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket/prefix/temp/events/03a20c57ae9642378ad1f829f390613a.snappy.parquet (offset=0, length=504731): io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: EKWQ310403FNNWKV; S3 Extended Request ID: qIu8bN3U+alq/s

The file mentioned as NoSuchKey actually exists and I can see it in S3, albeit with a deleted marker associated with it. Screenshot 2023-12-18 at 3 38 41 pm

Anyone has hints of what may be going wrong?

khoatrandata avatar Dec 18 '23 04:12 khoatrandata

I found in the implementation that after upserting the rows, it deletes the whole folder and not just the respective file

if keep_files is False:
    s3.delete_objects(
        path=temp_path or wg_config.s3_output,  # type: ignore[arg-type]
        boto3_session=boto3_session,
        s3_additional_kwargs=s3_additional_kwargs,
    )

so that may explain.. but what would be the best practice here?

khoatrandata avatar Dec 18 '23 05:12 khoatrandata

The fact that you are using parallel lambdas makes me think that there might be a race condition between them where an object is deleted by one while the other still requires it. You might want to have separate temporary prefixes to avoid this or preserve the files with keep_files=True

jaidisido avatar Dec 18 '23 09:12 jaidisido

It looks to me Iceberg is not very robust wrt parallel writing; keep_files=True leads to duplication upon subsequent loads so I tried appending the nano timestamp to the temporary path so it's unique but now I have ICEBERG_COMMIT_ERROR error due to this

khoatrandata avatar Dec 18 '23 10:12 khoatrandata

@khoatrandata how did you overcome the above error for parallel lambdas ?

B161851 avatar Jan 18 '24 12:01 B161851

@B161851 , at first I tried invoking the ingestion lambdas synchronously in a sequence for backfill but then that's too slow. I ended up using a crawler to create a glue table over the raw data and use sql insert into... iceberg which is quicker.

khoatrandata avatar Jan 21 '24 22:01 khoatrandata

@khoatrandata can you post that code snippet, as I am tried to followed you above answer but I couldn't make it. Can it possible to write to the iceberg table using the lambda ?.

B161851 avatar Jan 31 '24 13:01 B161851

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Mar 31 '24 15:03 github-actions[bot]