helia icon indicating copy to clipboard operation
helia copied to clipboard

UnixFs Stat Command Returns Invalid Data

Open jtsmedley opened this issue 1 year ago • 2 comments

When using stat, it does not return the correct dagSize values for non-empty directories. It appears that this line is using the byteLength of the UnixFS data in the directory block. This causes stat commands on non-empty directories to return an invalid value that does not match the Kubo value.

jtsmedley avatar Aug 08 '24 15:08 jtsmedley

unixfs entry should have a metadata block that contains sizes of their children, so we should be able to use what is in that metadata directly, or sum sizes of children's metadata

SgtPooki avatar Aug 15 '24 14:08 SgtPooki

I think the API has some rough edges here.

Of note:

  • fileSize/localFileSize doesn't apply to directories
  • dagSize/localDagSize this is impossible to calculate for directories unless all blocks are present in the block store, since the size of a directory is not stored in the root DAG node for that directory - you have to traverse the DAG, calculating block sizes as you go which can be expensive
  • blocks In the equivalent Kubo API call, it seems to treat this as the number of Links in the root dag-pb node that resolve to files (sub dirs are ignored). If the directory is large enough to become sharded, this number is the number of sub-shards, so it's not terribly useful - Helia just returns 1 for directories since there's only ever one root block, though there could be plenty of sub-shards in a sharded directory, again not terribly useful

Perhaps we should break up the UnixFSStats interface?

// these involve traversing a DAG so are expensive to calculate
interface UnixFSDAGStats {
  // if all blocks for this DAG are in the blockstore
  complete: boolean

  // how many blocks make up the DAG - directories, sub-shards, files, leaf nodes, etc
  // - only accurate if `complete` is `true`
  blocks: bigint

  // how many bytes of the DAG are in the blockstore
  localDagSize: bigint

  // how many bytes of the file/directory are in the blockstore
  localSize: bigint
}

// a file is a DAG-PB node with one or more linked nodes that contain file data or links to other nodes
interface UnixFSFileStats {
  type: 'file'
  cid: CID

  // how big the DAG that holds the file is in bytes (e.g. the sum of all dag-link Tsize fields plus the
  // serialized size of the root node)
  dagSize: bigint

  // UnixFS metadata (has mtime/mode/block count/file size)
  unixfs: UnixFS
}

interface ExtendedUnixFSFileStats extends UnixFSFileStats, UnixFSDAGStats {
}

// a directory is a DAG-PB node with links to other DAG-PB or raw nodes
// if the unixfs type is `directory`, each linked node is a file, a raw block or a directory
// if the type is `hamt-sharded-directory`, each linked node is a file, a raw block or a directory
interface UnixFSDirectoryStats {
  type: 'directory'
  cid: CID
  
  // UnixFS metadata (has mtime/mode)
  unixfs: UnixFS
}

// these involve traversing the DAG so are expensive to calculate
interface ExtendedUnixFSDirectoryStats extends UnixFSDirectoryStats, UnixFSDAGStats {
  // the size of all files in the directory including in subdirectories
  size: bigint
}

// a raw entry is a bare block that contains file data
interface UnixFSRawStats {
  type: 'raw'
  cid: CID

  // how big the block is
  size: bigint
}

type UnixFSStats = UnixFSFileStats | UnixFSDirectoryStats | UnixFSRawStats
type ExtendedUnixFSStats = ExtendedUnixFSFileStats | ExtendedUnixFSDirectoryStats | UnixFSRawStats

The @helia/unixfs interface would end up changing to something like:

interface StatOptions {
  offline?: boolean
  // ...same as currently
}

interface ExtendedStatOptions {
  extended: true // not sure if this is the right name
  // ...same as currently
}

fs.stat(cid, options?: StatOptions): Promise<UnixFSStats>
fs.stat(cid, options?: ExtendedStatOptions): Promise<ExtendedUnixFSStats>

achingbrain avatar Sep 02 '24 17:09 achingbrain

For ExtendedUnixFSDirectoryStats, what would be the difference between size and localSize?

If I understood the proposed change, it would involve traversing the DAG to get ExtendedUnixFSDirectoryStats, but would that also involve fetching missing blocks? How would that affect the size and localSize?

2color avatar Mar 17 '25 15:03 2color