iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Add Files metadata table

Open Gowthami03B opened this issue 10 months ago • 1 comments

Gowthami03B avatar Apr 18 '24 04:04 Gowthami03B

Hi @HonahX could we get your help in triggering this workflow to see if the CI succeeds?

sungwy avatar May 06 '24 17:05 sungwy

Sorry for now following up on this @Gowthami03B Could you rebase so we can get this in? Thanks!

Fokko avatar May 29 '24 10:05 Fokko

@Gowthami03B gentle ping, this is the last metadata table, and we would love to include this into the release! 🙌

Fokko avatar Jun 26 '24 21:06 Fokko

@Fokko @kevinjqliu @amogh-jahagirdar Can I get a re-review here please? Want to close this asap for the release timeline :)

Gowthami03B avatar Jun 27 '24 12:06 Gowthami03B

LGTM too. @Gowthami03B Thanks for working on this! Thanks everyone for reviewing. Let's get this last metadata table in!

HonahX avatar Jul 04 '24 04:07 HonahX

Hi guys, sorry if it's not the right place to ask this question. Do you know of a viable way to speed up table.inspect.files() for large tables? Maybe something in mind that I could implement and contribute to upstream.

I haven't profiled yet but I guess the gist of the issue is manifest.fetch_manifest_entry being called synchronously and sequentially in a loop. Offloading this call to a thread-based executor doesn't help much, probably because of GIL, and a process-based executor is harder to implement because of unpicklable types involved.

As of now pyspark's .files metatable collection can be done considerably quicker than pyiceberg's

DieHertz avatar Sep 25 '24 20:09 DieHertz

I think there's definitely room for improvement. @DieHertz do you mind opening an issue for this?

kevinjqliu avatar Sep 25 '24 21:09 kevinjqliu

Will do

DieHertz avatar Sep 25 '24 21:09 DieHertz