lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

Batch DBIO markers in listObjects

Open arielshaqed opened this issue 1 year ago • 1 comments

DBIO (DataBricks) performs many getFileStatus calls over lakeFSFS. Each of these calls looks for an object or directory marker named _started_* or _committed_*. Looking for a marker involves a getObject, but if the object is not found we listObject(..., 1) at it to discover whether it's a directory marker.[^1]

We already optionally batch getObject for these. Optionally (same config parameters) also batch listObjects calls for these with amount==1.

[^1]: Short story: Hadoop FileSystems need getFileStatus to return a lot of information, including whether that file is a directory. That's actually hard to do correctly, and lakeFSFS performs multiple operations for each getFileStatus.

arielshaqed avatar Jun 14 '24 06:06 arielshaqed

@itaiad200 @arielshaqed Any update on this?

offirc2 avatar Jul 08 '24 11:07 offirc2