Sia
Sia copied to clipboard

Proposal: new .sia format

Open lukechampine opened this issue 7 years ago • 33 comments

The upcoming renter overhaul necessitates an update to how we store file metadata. The goals for the new format are:

High throughput for reading and writing
Small filesize
Provide metadata expected of traditional files (size, mode bits, etc.)
Support filesharing
Improve renter scalability
Easy to work with and build tools around

Non-goals:

Be extremely flexible/generic
Be human-readable
Scale past 10TB filesize or 1 million .sia files

As a recap, the current .sia format is flat file encoded with Sia encoding. It consists of a handful of metadata fields, followed by an array of fileContract objects:

type fileContract struct {
	ID     types.FileContractID
	IP     modules.NetAddress
	Pieces []pieceData

	WindowStart types.BlockHeight
}

type pieceData struct {
	Chunk      uint64
	Piece      uint64
	MerkleRoot crypto.Hash
}

The format has a number of shortcomings. It uses a contract ID and NetAddress to identify a host and contract, instead of the host's public key. It provides no means for addressing less than a full sector, making partial sector downloads impossible. The entire file must be read into memory in order to manipulate it, and indeed the renter currently reads every .sia file into memory on startup. Lastly, decoding the format requires the use of a non-standard encoding protocol.

I have been thinking about a new format for a while, and I think I have a decent design ready. If nothing else, I hope we can use it as a solid base to build off of or take ideas from.

Specification

The new .sia format is a gzipped tar archive. The name of the .sia file is assumed to be the name of the original file. The first file in the archive is called the Index, and subsequent files are called Blobs. In the tar format, the name of the Index is "index" and the name of each Blob is the host public key that it is associated with, as formatted by the types.SiaPublicKey.String() method. Implementations must preserve these names when extracting and archiving .sia files. The tar header of each file in the archive should be otherwise ignored.

Blobs are binary files that represent raw data. They contain only the information necessary to identify a set of bytes stored on a particular host. Each Blob is a flat array of entries, with each entry uniquely identifying a contiguous slice of sector data. The entry object contains a sector Merkle root, an offset within the sector, and a length. The order of the array is significant. In Go syntax, the definition of a Blob is:

type Blob []BlobEntry

type BlobEntry struct {
    MerkleRoot [32]byte
    Offset     uint32
    Length     uint32
}

An Index is a JSON object that references one or more Blobs and imbues their raw data with file semantics.

type Index struct {
    Version   int
    Filesize  int64       // original file size
    Mode      os.FileMode // original mode bits
    ModTime   time.Time   // time of last modification
    MasterKey [32]byte    // seed from which blob encryption keys are derived
    MinShards int         // number of shards required to recover file
    Blobs     []IndexBlob // len(Blobs) is the total number of shards
}

type IndexBlob struct {
    HostKey  types.SiaPublicKey
}

Discussion

The flow of working with the new .sia format is as follows: the archive is first extracted to a working directory, wherein each Blob is its own file. These files are then modified as needed. For example, when uploading, siad will repeatedly read a chunk of file data, encode it, and spawn a goroutine for each piece that uploads the sector data and appends a BlobEntry to the host's corresponding Blob file. In our current terminology, the BlobEntry for "piece i, chunk j" is written to the file named Index.Blobs[i].HostKey.String() at offset j * sizeof(BlobEntry). When modifications are complete, the Index and Blobs are re-archived and the working directory is deleted. (When downloading, it is not necessary to extract the archive; it can simply be read into memory. Similarly, it is easy to read just the Index without extracting any files or processing any Blob data.)

This design satisfies most of the design goals above. It is simple to work with, and the technologies involved (tar, gzip, JSON) are well-supported in all mainstream languages. Compression brings the filesize close to the ideal. Files can be shared with anyone who has contracts with at least Index.MinShards of the hosts. The Index of a file can be stored in memory while leaving the Blobs on disk, allowing the renter to quickly report the metadata of any file without excessive RAM usage or disk I/O. Performance-wise, the design allows for each Blob file to be written in parallel; and since they are separate files, programmers can freely append to them without worrying about overwriting some other part of the format (as would be the case if the .sia format were a flat file). This also means that the metadata lives on disk rather than in RAM, which reduces memory requirements when operating on large files (or many files concurrently). Also note that it is not strictly necessary to fsync the Blob files, because an uncommitted write merely results in one or more BlobEntrys being lost, i.e. the data referenced by those entries will have to be re-uploaded. (The .sia file, on the other hand, should be fsync'd before the working directory is deleted.)

The Index and Blob types are analogous to Unix's "inode" and "block" data structures. Blobs are intended to be fairly generic; no assumptions should be made about the "shape" of the data they reference. For example, the format does not specify that a blob may only reference a given sector Merkle root once, nor does it specify that BlobEntrys may not reference overlapping slices of a sector. The semantics of a set of Blobs are defined by their Index, which contains enough metadata to satisfy the os.FileInfo interface. (This will make it easier to write things like FUSE layers and similar abstractions.) Since the Index is JSON-encoded and has a Version field, it is easier to extend than the Blob type. However, it is not a completely generic object: the specific encryption and redundancy algorithms are implied by the version number, rather than being explicitly configurable.

Shortcomings and possible improvements

Extracting each .sia file into a working directory may strain the filesystem if too many files are being concurrently modified. The exact number likely varies by filesystem. It also means that an ugly directory is left behind if the process is killed before it can cleanup. However, in the best case scenario, this means that a user can recover the .sia file simply by running tar -cz workdir.

The raw nature of the Blob type makes certain calculations difficult. For example, given an offset in the original file, locating the corresponding offset in a Blob is an O(n) operation: each Length field must be added until the sum reaches the offset. Compare this to the current format, where all pieces are assumed to be the same size; this means that an offset calculation is a simple division by the piece size. Another important calculation is determining the upload progress of a given file. Implemented naively, this would be an O(n) operation as well. We will have to employ a caching strategy to keep such calculations fast.

Some aspects of the file, such as its name and the total number of shards, are encoded implicitly rather than explicitly. Typically it's better to be explicit about such things. The only reason I hesitated with regard to the filename is that it makes it annoying to rename files: if the filename were a field in the Index, changing it would require the full sequence of extracting, decoding, encoding, and archiving, instead of a simple mv command. This reveals a more general deficiency of the format, which is that any modification requires extraction and re-archiving. The same would not be true of a flat file with fixed offsets.

I also considered adding a Checksum field to the Blob type. Right now, we use AEAD, so a separate checksum would be redundant. However, other implementations might not use AEAD, in which case they would require a separate checksum. Since the Blobs are (eventually) gzipped, an unused Checksum field shouldn't increase the final filesize very much. Still, I'd rather avoid adding an unused field unless we feel very strongly that it will pay off in the long run.

EDIT 1: Removed the Filename field from the IndexBlob type. Instead, the host key will be used to identify the path of the blob file after extraction. However, IndexBlob will remain a struct, instead of a bare types.SiaPublicKey. This allows us to add fields later without breaking JSON compatibility. One question here is how exactly we should convert the types.SiaPublicKey to a filename. It does have a String() method, but the return string has an algorithm prefix (ed25519:a0b1c2d3...). Do we want the prefix in the filename? Does it matter? : is legal in POSIX filenames but I'd have to double-check on Windows.

EDIT 2: Clarified how implementations should treat the tar headers of the Index and Blob files. (discussion)

Jul 29 '17 21:07 lukechampine

I thought the file now is like f[ChunkID][PieceID] before, how can we form a format like that by this new structure @lukechampine

Jul 30 '17 13:07 tlightsky

That should still be possible. The chunk index translates to the index of a blob entry. The only difference is that currently, "chunk" always means MinPieces * SectorSize bytes, while in the new format the size is arbitrary.

In the typical case (uploading a single file), you will be filing up the whole sector with one file, so the chunks will have the same size they do now. But the size may be smaller if we start packing multiple files into one sector.

Jul 30 '17 17:07 lukechampine

so Blobs are like:

[
  {
    "Filename": "fname",
    "HostKey": "key",
    "BlobEntries": [
      {
        "MerkleRoot": "root",
        "Offset": 10,
        "Length": 20
      }
    ]
  }
]

i think each IndexBlob represents chunks stored by this host, but what about the RS pieces, what is the piece index of each Blob Entry

Jul 31 '17 04:07 tlightsky

The Index has a slice of IndexBlobs. Each IndexBlob corresponds to one of the RS pieces. So the piece index is the index in the IndexBlobs slice. To access chunk i of piece j, you would open the file IndexBlobs[j].Filename and seek to the ith BlobEntry.

Jul 31 '17 05:07 lukechampine

so it's like:

[
  {
    "file1": {
      "HostKey": "key",
      "BlobEntries": [
        {
          "MerkleRoot": "root",
          "Offset": 10,
          "Length": 20
        }
      ]
    },
    "file2": {
      "HostKey": "key",
      "BlobEntries": [
        {
          "MerkleRoot": "root",
          "Offset": 10,
          "Length": 20
        }
      ]
    }    
  }
]

one host would contains all pieces from one RS index? and one tar.gz file would contains multi file info? @lukechampine

Jul 31 '17 07:07 tlightsky

No. Each .sia correspond to one file.

Jul 31 '17 15:07 lukechampine

What is the formal language allowed for Filename string values? I am wondering about allowed characters and any other possible syntax restrictions (such as if an empty filename is allowed or the maximum filename size allowed). Is the filename necessarily an ASCII string or could it be 8-bit or unicode characters? Does Sia need to know if two filenames are equal? Assuming it does, are filenames compared in a case-sensitive or case-insensitive way? Are filenames necessarily canonical (minimized) by removing redundant .. ?

Jul 31 '17 15:07 rudibs

the Filename in IndexBlob is for what? if it's not multi files in one .sia file

Jul 31 '17 15:07 tlightsky

@rudibs: easiest would be to allow them to be any valid Go string (UTF-8). However, they should strictly be interpreted as filenames, not paths. Escaping should be performed in order to prevent filenames that contain ../ from actually traversing a directory.

@tlightsky: ah, I see why it is confusing. IndexBlob.Filename refers to the path of an unextracted Blob file. It's needed so that the Index knows where to find the actual Blob data. However, it may be simpler to mandate that when you extract a .sia, the Blob files must be named <hostkey>.blob. That way you can just use the existing HostKey field and drop the Filename field. This also removes the need to specify rules around legal Filename strings.

Jul 31 '17 16:07 lukechampine

I'm thinking about this section above:

Another important calculation is determining the upload progress of a given file. Implemented naively, this would be an O(n) operation as well. We will have to employ a caching strategy to keep such calculations fast.

One way to do this would be to add a field to the Index object for "extended attributes," as seen in various other formats. Then siad could have a special field called Index.X.UploadPercent (for example), which would be incremented each time a BlobEntry was added. Such a field would make it fast+easy for the renter to identify which files need repairing. Other programs that interact with .sia files can ignore the UploadPercent field, and/or use special fields of their own.

Another scenario that we should address explicitly is "migrating" -- dropping one host and forming a contract with another. After this happens, we need to iterate through all the .sia files, identify the Blobs stored on the old host, reupload that data to the new host, and update the .sia file accordingly. This will be a time-consuming process (scaling with the number of files), and I don't think there's much that a format can do to speed it up. But we can at least try to make the migration logic as simple as possible.

Jul 31 '17 20:07 lukechampine

@lukechampine we may want to be consistent with POSIX which does not allow / in filenames and i believe disallows other characters as well. that would mean less than ASCII and not the same as escaping because there is no notion of "escaping filenames" in POSIX.

Jul 31 '17 21:07 rudibs

Discussion with @rudibs convinced me to remove the Filename field from IndexBlob. He also approved of the decision to omit the original filename from the Index. This could cause issues down the line (e.g. when writing a FUSE layer) because in POSIX, an inode may have multiple filenames due to soft-links.

@rudibs and I also discussed at length whether to move some of the file metadata into the existing metadata fields supported by the tar format. tar allows you to specify a number of metadata fields on each file in the archive. It would seem, then, that some of the fields in the Index object are redundant. We could move Mode, ModTime, and (maybe) Filesize to the tar header directly, and either keep the rest of the fields in Index or somehow encode them in the header's Xattrs field.

There's a problem with this approach though: in the tar format, each header has an associated file in the archive. Which file should be associated with this special header? The Index, perhaps? But then if you naively extracted the .sia archive of an executable file, the extracted index file would also be executable. Alternatively, the header could be paired with a 0-length archive entry, but this might make the format more difficult for third parties to work with.

My current position is that the redundancy of the tar metadata fields is acceptable. Those fields should be interpreted as applying to the files inside the archive, not the actual file stored on the Sia network. If we repurpose them for something else, we may break other developers' assumptions. And if we need to keep the Index file anyway, I don't see much benefit. Still, I think there are two takeaways here: firstly, the specification should be more clear about how to handle the tar metadata of the Index and the Blobs. Secondly, we should strive to ensure that the Index contains everything expected of a POSIX file; part of @rudibs' motivation for reusing the tar header is that it is already known to be POSIX-compliant. Finally, we should examine the equivalent requirements for Windows files to make sure that we aren't lacking any expected fields. The Dokan project may provide us with some guidance here.

Jul 31 '17 23:07 lukechampine

Another thought: should the order and/or names of files in the tar archive be specified? We could impose a strict ordering, such as: the Index must always come first, and the order of the Blobs must match the order of Index.Blobs. This would mean that the name of each tar entry could be ignored, which would likely be a Good Thing (since these names could potentially be a point of attack if someone constructed a malicious .sia file). However, doing so would also make it difficult for a third-party archiving program to create a valid .sia file, since the program might order the files differently.

My gut feeling is that the order should be unspecified, and the names of the files should be used to identify the Index and Blobs. The downside of this is that it could hurt performance when processing many .sia files: if the Index is at the end of the tar archive, then implementations would have to seek through a bunch of Blob data before reaching it. But in practice, most .sia files will be created by our software, which will put the Index first, and we can recommend that other implementations do the same.

Aug 01 '17 00:08 lukechampine

POSIX has 3 timestamps technically. https://www.unixtutorial.org/2008/04/atime-ctime-mtime-in-unix-filesystems/

Aug 01 '17 01:08 rudibs

a general and safe way to convert an arbitrary string to a filename is by hashing it and using the resultant simple hex representation maybe with a .dat extension or something appended.

Aug 01 '17 01:08 rudibs

"B-Tree" can be used to speed up range type searches for partial reads so you don't have to linearly scan through all blocks. https://en.wikipedia.org/wiki/B-tree "Range-Tree" or K-Tree (aka K-D tree) can be used to generalize to 2 dimension to search e.g. by filename and offset quickly. The B-Tree is particularly well suited to on-disk databases and is therefore among the most popular for tons including Bolt (B+Tree) and BerkeleyDB. There are many Go B-Tree implementations now.

Aug 01 '17 01:08 rudibs

[
  {
    "HostKey1": [
      {
        "MerkleRoot": "root",
        "Offset": 10,
        "Length": 20
      }
    ],
    "HostKey2": [
      {
        "MerkleRoot": "root",
        "Offset": 10,
        "Length": 20
      }
    ]
  },
  {
    "HostKey3": [
      {
        "MerkleRoot": "root",
        "Offset": 10,
        "Length": 20
      }
    ],
    "HostKey4": [
      {
        "MerkleRoot": "root",
        "Offset": 10,
        "Length": 20
      }
    ]
  }
]

@lukechampine ok, i think i have understand the new structure

Aug 01 '17 02:08 tlightsky

@lukechampine does that mean there will be many empty index in BlobEntries? would that be a problem?

Aug 01 '17 05:08 tlightsky

I'm not completely sure what you mean by "empty index", but it may be the case that a BlobEntry will be zeroed out to begin with, and filled in later. For example, if we are uploading in random order, then first we would want to allocate a Blob file that is large enough to hold all the entries; next we would write them to disk as each upload operation finished. If you paused this process in the middle, you would have some BlobEntrys that are zero (empty), and others that are not. If you wanted to resume the upload, you would need to figure out which entries were uploaded and which weren't. I think it is sufficient to define that if BlobEntry.Length == 0, the BlobEntry was not uploaded yet.

However, this scheme assumes that all BlobEntrys are the same size! (Otherwise, you wouldn't know how many to allocate for a given filesize.) This is indeed likely to be the case when performing a single upload of a large file. But if we start doing fancier things later (like having two files share a sector), this assumption may break.

Aug 01 '17 05:08 lukechampine

@lukechampine I mean,

a BlobEntry array is just an array under one host
we don't store all RS piece j in one host,
we use BlobEntry[i] to represent chunk i

so there must be many empty BlobEntry[i] like:

  {
    "HostKey1": [
      {
        "MerkleRoot": "root1",
        "Offset": 0,
        "Length": 20
      },
      {},
      {
        "MerkleRoot": "root3",
        "Offset": 0,
        "Length": 20
      },
    ],
    "HostKey2": [
      {},
      {
        "MerkleRoot": "root2",
        "Offset": 0,
        "Length": 20
      },
      {}
    ]
  }

Aug 01 '17 06:08 tlightsky

The process is:

Read n bytes from the original file. The maximum is SectorSize * index.MinPieces bytes. This is one chunk.
Run erasureCode(chunk) to get a [][]byte. Each of these is called a "piece" or a "shard," and each goes to a separate host. The length of each []byte is n / index.MinPieces.
Upload each shard to its corresponding host.
Create a BlobEntry for each shard that uniquely identifies the shard data on the host.
Write the BlobEntry to its corresponding Blob file.
Repeat until the file has been fully read.

So each Blob has a BlobEntry for every chunk.

Aug 01 '17 19:08 lukechampine

the IndexBlobs[j] mean the shard j the BlobEntry[i] mean the chunk i a host may not have each of the chunk i , so there might be some empty BlobEntry[i]

Aug 02 '17 03:08 tlightsky

a host may not have each of the chunk i

I think this is the source of the confusion, though I'm not quite sure what you mean. Host j will store shard j for every chunk. It's possible for a BlobEntry to be empty if the file is not fully uploaded, but once the upload is complete, none of the BlobEntrys will be empty.

Aug 02 '17 04:08 lukechampine

are we talking about the upload process in the future? cause the worker(host/contract) selection seems quite random now

Aug 02 '17 11:08 tlightsky

Ah, I understand now. You are right, in the current codebase, hosts are assigned shards at random. When I wrote the first iteration of the renter, host j always stored shard j, so the format was designed with that assumption.

I will try to summarize the pros and cons of each approach:

Strict assignment

Pros:

Conceptually simple; model is easier to visualize and code is easier to reason about
Shard index is implicit, so it does not need to be stored in the format or in memory
Each file is stored on exactly len(shards) hosts, so each host is responsible for storing all chunks and thus the chunk index can be implicit as well
Some upload/download optimizations may only be possible if strict assignment is assumed

Cons:

Constrains the design of upload/download algorithms
Files can't be stored across more than len(shards) hosts, which may impact host replacement strategies
If we switch to arbitrary assignment later, we need to convert existing .sia files to add the shard index

Arbitrary assignment

Pros:

More flexible with respect to different upload/download algorithms
Files can be stored across more than len(shards) hosts
If we switch back to strict assignment later, we can set all shard indices in the format to 0 and they will be compressed away

Cons:

Format must store a ShardIndex in each BlobEntry, and possibly a ChunkIndex as well
Determining which host is storing a given shard is O(n)
If files are stored across more than len(shards) hosts, then determining which hosts are storing a given chunk is O(n)
if we switch back to strict assignment later, we need to keep all the arbitrary assignment code for compatibility with existing .sia files

I will add more as I think of them.

Broadly speaking, the performance difference seems like a wash; it's difficult to know in advance which approach will yield the fastest upload/download performance. Format-wise, strict assignment consumes less space, but arbitrary assignment makes compatibility easier if we switch to strict assignment later. When replacing hosts, arbitrary assignment appears more flexible, but the full implications of this are unclear. Lastly, both approaches support file packing (storing chunks from multiple files in one host sector).

I also want to make one point clear in case it isn't already, which is that in strict assignment, the host->shard mapping may still vary from file to file. That is, host j does not necessarily hold shard j for every file being stored by the renter.

Aug 02 '17 19:08 lukechampine

We've talked about this some offline, and there is really just one major conclusion: data for the same chunk needs to be nested into the same part of the file, which means the format for the file data / chunk data needs to be adjusted so that all the chunks come in a row. This probably means having some part of the header dedicated to listing the different hosts, ip addresses, and pubkeys, and then each chunk can reference which host it's using with a 1 byte number. At least then it's still fairly well compressed.

@rudibs we would want the data in this file to be more or less posix compliant, though it's not a filesystem, it's just a list of files like .tar. But it should have all the necessary metadata.

Another assumption that may have been in place when creating this proposal was that this file format would be modified in real time as pieces and chunks get uploaded or modified. With the new database upgrade for the filesystem, this is no longer going to be true. These files are to be extracted from the database.

Sep 04 '17 14:09 DavidVorick

img_20180206_193415

Feb 07 '18 01:02 DavidVorick

The above photo is a chunk layout. 8 bytes for EC data (what type of code, how many pieces, etc). The next 8 bytes point to an offset in an extension file that contains extra pieces. It's set to '0' if there are no extra pieces, which will be the case for most chunks. Then what follows is a list of pieces, which have 4 bytes for a crypto key nonce, 2 bytes that say what file contract (a lookup table is kept separately to find the full file contract id), and finally 10 bytes which are the first 10 bytes of the Merkle root that can be used to query the host.

We will need some protocol changes to get this far, so initially we will need 32 bytes for the full root.

Because each chunk is constant size, you can know where in the file to find a chunk in constant time, without keeping a lookup table. For 50 pieces, each chunk is about 2kb on disk for 40MB of data stored in Sia. After we implement partial root lookup, it's less than 1kb per chunk.

Feb 07 '18 01:02 DavidVorick

In the discussion above you mention "Files can be shared with anyone who has contracts with at least Index.MinShards of the hosts." - What if the instance importing a shared .sia file is missing contracts? Will the renter automatically form new contracts by a already implemented mechanism? Or is this addressed in a separate issue (didn't find anything).

Also, for discussion: If sia gets to the point where a renter can restrict the host selection (e. g. by region), is there already a plan how to handle such scenario? Let's say if the sharing instance uses world wide host selection but the the receiving instance only allows hosts from a specific region? This might break the Index.MinShards available prerequisite because the renter won't be able to form contracts with needed hosts due to region restriction (of course this might apply to other host selection criteria as well like pricing, availability, ....).

Apr 29 '18 06:04 chrsch

@chrsch good questions. If the recipient doesn't have contracts with at least MinShards hosts storing the file, they won't be able to download it. To rectify this, either the recipient must form new contracts, or the sender must replicate shards of the file to additional hosts.

Forming new contracts is expensive, so renters will avoid doing so if they can avoid it. I could see this leading to some degree of centralization: in order to minimize the number of new contracts that need to be formed, the most rational behavior for renters is to pick some subset of hosts (>= MinShards) that almost every renter has contracts with. The region selection you mentioned impacts this as well: it means that this subset can't be limited to a single region.

One way to get combat this is a sort of "Lightning Network" for hosts, allowing you to pay hosts even if you don't have contracts with them. In the simplest example, you have a contract with host A, but the file is only stored on host B. However, host A has a payment channel with host B. So instead of paying B directly, you pay A, who then pays B on your behalf. In the general case, host A does not need to have a direct payment channel with host B; it just needs a way to route payments to host B, where the route potentially involves many hops.

This is all very speculative, but the basic principle of using hosts as hubs is sound: hosts by definition have properties that make them good hubs, like high availability and bandwidth.

One more thing: it's quite possible that "true" filesharing on Sia turns out to be uncommon in practice. Instead, a middleman could implement a traditional filesharing platform on top of Sia (e.g. sia.pixeldrain.com) and handle payments out-of-band. In this scenario, direct P2P filesharing would still exist, but would likely only be used by enthusiasts or people with special circumstances.

Apr 29 '18 16:04 lukechampine

Sia Sia copied to clipboard

Proposal: new .sia format

Specification

Discussion

Shortcomings and possible improvements

Sia
Sia copied to clipboard