iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

support pyarrow recordbatch as a valid data source for writing Iceberg table

Open djouallah opened this issue 1 year ago • 2 comments

Feature Request / Improvement

currently using this in a an environment with limited RAM

  df=final.arrow()
  catalog , table_location = connect_catalog(storage)
  catalog.create_table_if_not_exists(db+".scada",schema=df.schema,location= table_location+f'/{db}/scada' )
  catalog.load_table(db+".scada").append(df)

I get sometimes out of memory because arrow table needs to be fully materialized in memory, I can generate recordbatch from my source system which will use less memory,

djouallah avatar Aug 06 '24 02:08 djouallah

@djouallah Thanks for raising this. To clarify, does the final.arrow() cause an OOM, or the .append operation?

Fokko avatar Aug 06 '24 07:08 Fokko

Append operation cause OOM

djouallah avatar Aug 06 '24 07:08 djouallah

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Feb 03 '25 00:02 github-actions[bot]

any news :)

djouallah avatar Feb 25 '25 02:02 djouallah

@djouallah do you have the stack trace? Im curious which part of append is causing the OOM. BTW there's an to_arrow_batch_reader() API for getting arrow batch reader, which can be more memory efficient.

kevinjqliu avatar Feb 25 '25 03:02 kevinjqliu

@djouallah do you have the stack trace? Im curious which part of append is causing the OOM. BTW there's an to_arrow_batch_reader() API for getting arrow batch reader, which can be more memory efficient.

it is not a bug, when you do arrow(), all the data will be loaded into memory so either you add more ram, or make catalog.load_table accept recordbatch a valid input ( like delta_rs for example)

djouallah avatar Feb 25 '25 04:02 djouallah

Looking at the above, the 2 critical parts are:

df=final.arrow()
...
tbl.append(df)

Im surprised that the .arrow() part didn't cause the OOM but the .append() part did.

accept recordbatch a valid input ( like delta_rs for example)

yep, we got one side of it already. to_arrow_batch_reader() produces record batches. We just need to check the write path for .append()

Its kind of difficult to recreate a scenario where OOM happens. So if you have any other information or a way to reproduce, that'll be very helpful!

kevinjqliu avatar Feb 25 '25 15:02 kevinjqliu