helia
helia copied to clipboard
UnixFs Stat Command Returns Invalid Data
When using stat, it does not return the correct dagSize values for non-empty directories. It appears that this line is using the byteLength of the UnixFS data in the directory block. This causes stat commands on non-empty directories to return an invalid value that does not match the Kubo value.
unixfs entry should have a metadata block that contains sizes of their children, so we should be able to use what is in that metadata directly, or sum sizes of children's metadata
I think the API has some rough edges here.
Of note:
- fileSize/localFileSize doesn't apply to directories
- dagSize/localDagSize this is impossible to calculate for directories unless all blocks are present in the block store, since the size of a directory is not stored in the root DAG node for that directory - you have to traverse the DAG, calculating block sizes as you go which can be expensive
- blocks In the equivalent Kubo API call, it seems to treat this as the number of
Links in the root dag-pb node that resolve to files (sub dirs are ignored). If the directory is large enough to become sharded, this number is the number of sub-shards, so it's not terribly useful - Helia just returns1for directories since there's only ever one root block, though there could be plenty of sub-shards in a sharded directory, again not terribly useful
Perhaps we should break up the UnixFSStats interface?
// these involve traversing a DAG so are expensive to calculate
interface UnixFSDAGStats {
// if all blocks for this DAG are in the blockstore
complete: boolean
// how many blocks make up the DAG - directories, sub-shards, files, leaf nodes, etc
// - only accurate if `complete` is `true`
blocks: bigint
// how many bytes of the DAG are in the blockstore
localDagSize: bigint
// how many bytes of the file/directory are in the blockstore
localSize: bigint
}
// a file is a DAG-PB node with one or more linked nodes that contain file data or links to other nodes
interface UnixFSFileStats {
type: 'file'
cid: CID
// how big the DAG that holds the file is in bytes (e.g. the sum of all dag-link Tsize fields plus the
// serialized size of the root node)
dagSize: bigint
// UnixFS metadata (has mtime/mode/block count/file size)
unixfs: UnixFS
}
interface ExtendedUnixFSFileStats extends UnixFSFileStats, UnixFSDAGStats {
}
// a directory is a DAG-PB node with links to other DAG-PB or raw nodes
// if the unixfs type is `directory`, each linked node is a file, a raw block or a directory
// if the type is `hamt-sharded-directory`, each linked node is a file, a raw block or a directory
interface UnixFSDirectoryStats {
type: 'directory'
cid: CID
// UnixFS metadata (has mtime/mode)
unixfs: UnixFS
}
// these involve traversing the DAG so are expensive to calculate
interface ExtendedUnixFSDirectoryStats extends UnixFSDirectoryStats, UnixFSDAGStats {
// the size of all files in the directory including in subdirectories
size: bigint
}
// a raw entry is a bare block that contains file data
interface UnixFSRawStats {
type: 'raw'
cid: CID
// how big the block is
size: bigint
}
type UnixFSStats = UnixFSFileStats | UnixFSDirectoryStats | UnixFSRawStats
type ExtendedUnixFSStats = ExtendedUnixFSFileStats | ExtendedUnixFSDirectoryStats | UnixFSRawStats
The @helia/unixfs interface would end up changing to something like:
interface StatOptions {
offline?: boolean
// ...same as currently
}
interface ExtendedStatOptions {
extended: true // not sure if this is the right name
// ...same as currently
}
fs.stat(cid, options?: StatOptions): Promise<UnixFSStats>
fs.stat(cid, options?: ExtendedStatOptions): Promise<ExtendedUnixFSStats>
For ExtendedUnixFSDirectoryStats, what would be the difference between size and localSize?
If I understood the proposed change, it would involve traversing the DAG to get ExtendedUnixFSDirectoryStats, but would that also involve fetching missing blocks? How would that affect the size and localSize?