lakeFS
lakeFS copied to clipboard
Batch DBIO markers in listObjects
DBIO (DataBricks) performs many getFileStatus calls over lakeFSFS. Each of these calls looks for an object or directory marker named _started_* or _committed_*. Looking for a marker involves a getObject, but if the object is not found we listObject(..., 1) at it to discover whether it's a directory marker.[^1]
We already optionally batch getObject for these. Optionally (same config parameters) also batch listObjects calls for these with amount==1.
[^1]: Short story: Hadoop FileSystems need getFileStatus to return a lot of information, including whether that file is a directory. That's actually hard to do correctly, and lakeFSFS performs multiple operations for each getFileStatus.
@itaiad200 @arielshaqed Any update on this?