Sia
Sia copied to clipboard
Proposal: new .sia format
The upcoming renter overhaul necessitates an update to how we store file metadata. The goals for the new format are:
- High throughput for reading and writing
- Small filesize
- Provide metadata expected of traditional files (size, mode bits, etc.)
- Support filesharing
- Improve renter scalability
- Easy to work with and build tools around
Non-goals:
- Be extremely flexible/generic
- Be human-readable
- Scale past 10TB filesize or 1 million
.sia
files
As a recap, the current .sia
format is flat file encoded with Sia encoding. It consists of a handful of metadata fields, followed by an array of fileContract
objects:
type fileContract struct {
ID types.FileContractID
IP modules.NetAddress
Pieces []pieceData
WindowStart types.BlockHeight
}
type pieceData struct {
Chunk uint64
Piece uint64
MerkleRoot crypto.Hash
}
The format has a number of shortcomings. It uses a contract ID and NetAddress to identify a host and contract, instead of the host's public key. It provides no means for addressing less than a full sector, making partial sector downloads impossible. The entire file must be read into memory in order to manipulate it, and indeed the renter currently reads every .sia file into memory on startup. Lastly, decoding the format requires the use of a non-standard encoding protocol.
I have been thinking about a new format for a while, and I think I have a decent design ready. If nothing else, I hope we can use it as a solid base to build off of or take ideas from.
Specification
The new .sia
format is a gzipped tar archive. The name of the .sia
file is assumed to be the name of the original file. The first file in the archive is called the Index, and subsequent files are called Blobs. In the tar format, the name of the Index is "index"
and the name of each Blob is the host public key that it is associated with, as formatted by the types.SiaPublicKey.String()
method. Implementations must preserve these names when extracting and archiving .sia
files. The tar header of each file in the archive should be otherwise ignored.
Blobs are binary files that represent raw data. They contain only the information necessary to identify a set of bytes stored on a particular host. Each Blob is a flat array of entries, with each entry uniquely identifying a contiguous slice of sector data. The entry object contains a sector Merkle root, an offset within the sector, and a length. The order of the array is significant. In Go syntax, the definition of a Blob is:
type Blob []BlobEntry
type BlobEntry struct {
MerkleRoot [32]byte
Offset uint32
Length uint32
}
An Index is a JSON object that references one or more Blobs and imbues their raw data with file semantics.
type Index struct {
Version int
Filesize int64 // original file size
Mode os.FileMode // original mode bits
ModTime time.Time // time of last modification
MasterKey [32]byte // seed from which blob encryption keys are derived
MinShards int // number of shards required to recover file
Blobs []IndexBlob // len(Blobs) is the total number of shards
}
type IndexBlob struct {
HostKey types.SiaPublicKey
}
Discussion
The flow of working with the new .sia
format is as follows: the archive is first extracted to a working directory, wherein each Blob is its own file. These files are then modified as needed. For example, when uploading, siad
will repeatedly read a chunk of file data, encode it, and spawn a goroutine for each piece that uploads the sector data and appends a BlobEntry
to the host's corresponding Blob file. In our current terminology, the BlobEntry
for "piece i, chunk j" is written to the file named Index.Blobs[i].HostKey.String()
at offset j * sizeof(BlobEntry)
. When modifications are complete, the Index and Blobs are re-archived and the working directory is deleted. (When downloading, it is not necessary to extract the archive; it can simply be read into memory. Similarly, it is easy to read just the Index without extracting any files or processing any Blob data.)
This design satisfies most of the design goals above. It is simple to work with, and the technologies involved (tar, gzip, JSON) are well-supported in all mainstream languages. Compression brings the filesize close to the ideal. Files can be shared with anyone who has contracts with at least Index.MinShards
of the hosts. The Index of a file can be stored in memory while leaving the Blobs on disk, allowing the renter to quickly report the metadata of any file without excessive RAM usage or disk I/O. Performance-wise, the design allows for each Blob file to be written in parallel; and since they are separate files, programmers can freely append to them without worrying about overwriting some other part of the format (as would be the case if the .sia
format were a flat file). This also means that the metadata lives on disk rather than in RAM, which reduces memory requirements when operating on large files (or many files concurrently). Also note that it is not strictly necessary to fsync
the Blob files, because an uncommitted write merely results in one or more BlobEntry
s being lost, i.e. the data referenced by those entries will have to be re-uploaded. (The .sia
file, on the other hand, should be fsync
'd before the working directory is deleted.)
The Index and Blob types are analogous to Unix's "inode" and "block" data structures. Blobs are intended to be fairly generic; no assumptions should be made about the "shape" of the data they reference. For example, the format does not specify that a blob may only reference a given sector Merkle root once, nor does it specify that BlobEntry
s may not reference overlapping slices of a sector. The semantics of a set of Blobs are defined by their Index, which contains enough metadata to satisfy the os.FileInfo
interface. (This will make it easier to write things like FUSE layers and similar abstractions.) Since the Index is JSON-encoded and has a Version
field, it is easier to extend than the Blob
type. However, it is not a completely generic object: the specific encryption and redundancy algorithms are implied by the version number, rather than being explicitly configurable.
Shortcomings and possible improvements
Extracting each .sia
file into a working directory may strain the filesystem if too many files are being concurrently modified. The exact number likely varies by filesystem. It also means that an ugly directory is left behind if the process is killed before it can cleanup. However, in the best case scenario, this means that a user can recover the .sia
file simply by running tar -cz workdir
.
The raw nature of the Blob type makes certain calculations difficult. For example, given an offset in the original file, locating the corresponding offset in a Blob is an O(n) operation: each Length
field must be added until the sum reaches the offset. Compare this to the current format, where all pieces are assumed to be the same size; this means that an offset calculation is a simple division by the piece size. Another important calculation is determining the upload progress of a given file. Implemented naively, this would be an O(n) operation as well. We will have to employ a caching strategy to keep such calculations fast.
Some aspects of the file, such as its name and the total number of shards, are encoded implicitly rather than explicitly. Typically it's better to be explicit about such things. The only reason I hesitated with regard to the filename is that it makes it annoying to rename files: if the filename were a field in the Index, changing it would require the full sequence of extracting, decoding, encoding, and archiving, instead of a simple mv
command. This reveals a more general deficiency of the format, which is that any modification requires extraction and re-archiving. The same would not be true of a flat file with fixed offsets.
I also considered adding a Checksum
field to the Blob type. Right now, we use AEAD, so a separate checksum would be redundant. However, other implementations might not use AEAD, in which case they would require a separate checksum. Since the Blobs are (eventually) gzipped, an unused Checksum
field shouldn't increase the final filesize very much. Still, I'd rather avoid adding an unused field unless we feel very strongly that it will pay off in the long run.
EDIT 1: Removed the Filename
field from the IndexBlob
type. Instead, the host key will be used to identify the path of the blob file after extraction. However, IndexBlob
will remain a struct, instead of a bare types.SiaPublicKey
. This allows us to add fields later without breaking JSON compatibility.
One question here is how exactly we should convert the types.SiaPublicKey
to a filename. It does have a String()
method, but the return string has an algorithm prefix (ed25519:a0b1c2d3...
). Do we want the prefix in the filename? Does it matter? :
is legal in POSIX filenames but I'd have to double-check on Windows.
EDIT 2: Clarified how implementations should treat the tar headers of the Index and Blob files. (discussion)
I thought the file now is like f[ChunkID][PieceID]
before,
how can we form a format like that by this new structure
@lukechampine
That should still be possible. The chunk index translates to the index of a blob entry. The only difference is that currently, "chunk" always means MinPieces * SectorSize
bytes, while in the new format the size is arbitrary.
In the typical case (uploading a single file), you will be filing up the whole sector with one file, so the chunks will have the same size they do now. But the size may be smaller if we start packing multiple files into one sector.
so Blobs
are like:
[
{
"Filename": "fname",
"HostKey": "key",
"BlobEntries": [
{
"MerkleRoot": "root",
"Offset": 10,
"Length": 20
}
]
}
]
i think each IndexBlob
represents chunks stored by this host,
but what about the RS
pieces,
what is the piece index of each Blob Entry
The Index has a slice of IndexBlobs
. Each IndexBlob
corresponds to one of the RS pieces. So the piece index is the index in the IndexBlobs
slice. To access chunk i of piece j, you would open the file IndexBlobs[j].Filename
and seek to the ith BlobEntry
.
so it's like:
[
{
"file1": {
"HostKey": "key",
"BlobEntries": [
{
"MerkleRoot": "root",
"Offset": 10,
"Length": 20
}
]
},
"file2": {
"HostKey": "key",
"BlobEntries": [
{
"MerkleRoot": "root",
"Offset": 10,
"Length": 20
}
]
}
}
]
one host would contains all pieces from one RS index?
and one tar.gz
file would contains multi file info?
@lukechampine
No. Each .sia correspond to one file.
What is the formal language allowed for Filename
string values? I am wondering about allowed characters and any other possible syntax restrictions (such as if an empty filename is allowed or the maximum filename size allowed). Is the filename necessarily an ASCII string or could it be 8-bit or unicode characters? Does Sia need to know if two filenames are equal? Assuming it does, are filenames compared in a case-sensitive or case-insensitive way? Are filenames necessarily canonical (minimized) by removing redundant ..
?
the Filename
in IndexBlob
is for what?
if it's not multi files in one .sia
file
@rudibs: easiest would be to allow them to be any valid Go string (UTF-8). However, they should strictly be interpreted as filenames, not paths. Escaping should be performed in order to prevent filenames that contain ../
from actually traversing a directory.
@tlightsky: ah, I see why it is confusing. IndexBlob.Filename
refers to the path of an unextracted Blob file. It's needed so that the Index
knows where to find the actual Blob data. However, it may be simpler to mandate that when you extract a .sia
, the Blob files must be named <hostkey>.blob
. That way you can just use the existing HostKey
field and drop the Filename
field. This also removes the need to specify rules around legal Filename
strings.
I'm thinking about this section above:
Another important calculation is determining the upload progress of a given file. Implemented naively, this would be an O(n) operation as well. We will have to employ a caching strategy to keep such calculations fast.
One way to do this would be to add a field to the Index object for "extended attributes," as seen in various other formats. Then siad
could have a special field called Index.X.UploadPercent
(for example), which would be incremented each time a BlobEntry
was added. Such a field would make it fast+easy for the renter to identify which files need repairing. Other programs that interact with .sia
files can ignore the UploadPercent
field, and/or use special fields of their own.
Another scenario that we should address explicitly is "migrating" -- dropping one host and forming a contract with another. After this happens, we need to iterate through all the .sia
files, identify the Blobs stored on the old host, reupload that data to the new host, and update the .sia
file accordingly. This will be a time-consuming process (scaling with the number of files), and I don't think there's much that a format can do to speed it up. But we can at least try to make the migration logic as simple as possible.
@lukechampine we may want to be consistent with POSIX which does not allow /
in filenames and i believe disallows other characters as well. that would mean less than ASCII and not the same as escaping because there is no notion of "escaping filenames" in POSIX.
Discussion with @rudibs convinced me to remove the Filename
field from IndexBlob
. He also approved of the decision to omit the original filename from the Index. This could cause issues down the line (e.g. when writing a FUSE layer) because in POSIX, an inode may have multiple filenames due to soft-links.
@rudibs and I also discussed at length whether to move some of the file metadata into the existing metadata fields supported by the tar format. tar allows you to specify a number of metadata fields on each file in the archive. It would seem, then, that some of the fields in the Index
object are redundant. We could move Mode
, ModTime
, and (maybe) Filesize
to the tar header directly, and either keep the rest of the fields in Index
or somehow encode them in the header's Xattrs
field.
There's a problem with this approach though: in the tar format, each header has an associated file in the archive. Which file should be associated with this special header? The Index
, perhaps? But then if you naively extracted the .sia
archive of an executable file, the extracted index
file would also be executable. Alternatively, the header could be paired with a 0-length archive entry, but this might make the format more difficult for third parties to work with.
My current position is that the redundancy of the tar metadata fields is acceptable. Those fields should be interpreted as applying to the files inside the archive, not the actual file stored on the Sia network. If we repurpose them for something else, we may break other developers' assumptions. And if we need to keep the Index
file anyway, I don't see much benefit.
Still, I think there are two takeaways here: firstly, the specification should be more clear about how to handle the tar metadata of the Index and the Blobs. Secondly, we should strive to ensure that the Index contains everything expected of a POSIX file; part of @rudibs' motivation for reusing the tar header is that it is already known to be POSIX-compliant. Finally, we should examine the equivalent requirements for Windows files to make sure that we aren't lacking any expected fields. The Dokan project may provide us with some guidance here.
Another thought: should the order and/or names of files in the tar archive be specified? We could impose a strict ordering, such as: the Index must always come first, and the order of the Blobs must match the order of Index.Blobs
. This would mean that the name of each tar entry could be ignored, which would likely be a Good Thing (since these names could potentially be a point of attack if someone constructed a malicious .sia file). However, doing so would also make it difficult for a third-party archiving program to create a valid .sia file, since the program might order the files differently.
My gut feeling is that the order should be unspecified, and the names of the files should be used to identify the Index and Blobs. The downside of this is that it could hurt performance when processing many .sia files: if the Index is at the end of the tar archive, then implementations would have to seek through a bunch of Blob data before reaching it. But in practice, most .sia files will be created by our software, which will put the Index first, and we can recommend that other implementations do the same.
POSIX has 3 timestamps technically. https://www.unixtutorial.org/2008/04/atime-ctime-mtime-in-unix-filesystems/
a general and safe way to convert an arbitrary string to a filename is by hashing it and using the resultant simple hex representation maybe with a .dat
extension or something appended.
"B-Tree" can be used to speed up range type searches for partial reads so you don't have to linearly scan through all blocks. https://en.wikipedia.org/wiki/B-tree "Range-Tree" or K-Tree (aka K-D tree) can be used to generalize to 2 dimension to search e.g. by filename and offset quickly. The B-Tree is particularly well suited to on-disk databases and is therefore among the most popular for tons including Bolt (B+Tree) and BerkeleyDB. There are many Go B-Tree implementations now.
[
{
"HostKey1": [
{
"MerkleRoot": "root",
"Offset": 10,
"Length": 20
}
],
"HostKey2": [
{
"MerkleRoot": "root",
"Offset": 10,
"Length": 20
}
]
},
{
"HostKey3": [
{
"MerkleRoot": "root",
"Offset": 10,
"Length": 20
}
],
"HostKey4": [
{
"MerkleRoot": "root",
"Offset": 10,
"Length": 20
}
]
}
]
@lukechampine ok, i think i have understand the new structure
@lukechampine does that mean there will be many empty index in BlobEntries
?
would that be a problem?
I'm not completely sure what you mean by "empty index", but it may be the case that a BlobEntry
will be zeroed out to begin with, and filled in later.
For example, if we are uploading in random order, then first we would want to allocate a Blob file that is large enough to hold all the entries; next we would write them to disk as each upload operation finished. If you paused this process in the middle, you would have some BlobEntry
s that are zero (empty), and others that are not. If you wanted to resume the upload, you would need to figure out which entries were uploaded and which weren't. I think it is sufficient to define that if BlobEntry.Length == 0
, the BlobEntry
was not uploaded yet.
However, this scheme assumes that all BlobEntry
s are the same size! (Otherwise, you wouldn't know how many to allocate for a given filesize.) This is indeed likely to be the case when performing a single upload of a large file. But if we start doing fancier things later (like having two files share a sector), this assumption may break.
@lukechampine I mean,
- a
BlobEntry
array is just an array under one host - we don't store all RS
piece j
in one host, - we use
BlobEntry[i]
to representchunk i
so there must be many empty BlobEntry[i]
like:
{
"HostKey1": [
{
"MerkleRoot": "root1",
"Offset": 0,
"Length": 20
},
{},
{
"MerkleRoot": "root3",
"Offset": 0,
"Length": 20
},
],
"HostKey2": [
{},
{
"MerkleRoot": "root2",
"Offset": 0,
"Length": 20
},
{}
]
}
The process is:
- Read n bytes from the original file. The maximum is
SectorSize * index.MinPieces
bytes. This is one chunk. - Run
erasureCode(chunk)
to get a[][]byte
. Each of these is called a "piece" or a "shard," and each goes to a separate host. The length of each[]byte
isn / index.MinPieces
. - Upload each shard to its corresponding host.
- Create a
BlobEntry
for each shard that uniquely identifies the shard data on the host. - Write the
BlobEntry
to its corresponding Blob file. - Repeat until the file has been fully read.
So each Blob has a BlobEntry
for every chunk.
the IndexBlobs[j]
mean the shard j
the BlobEntry[i]
mean the chunk i
a host may not have each of the chunk i , so there might be some empty BlobEntry[i]
a host may not have each of the chunk i
I think this is the source of the confusion, though I'm not quite sure what you mean. Host j will store shard j for every chunk. It's possible for a BlobEntry to be empty if the file is not fully uploaded, but once the upload is complete, none of the BlobEntrys will be empty.
are we talking about the upload process in the future? cause the worker(host/contract) selection seems quite random now
Ah, I understand now. You are right, in the current codebase, hosts are assigned shards at random. When I wrote the first iteration of the renter, host j always stored shard j, so the format was designed with that assumption.
I will try to summarize the pros and cons of each approach:
Strict assignment
Pros:
- Conceptually simple; model is easier to visualize and code is easier to reason about
- Shard index is implicit, so it does not need to be stored in the format or in memory
- Each file is stored on exactly
len(shards)
hosts, so each host is responsible for storing all chunks and thus the chunk index can be implicit as well - Some upload/download optimizations may only be possible if strict assignment is assumed
Cons:
- Constrains the design of upload/download algorithms
- Files can't be stored across more than
len(shards)
hosts, which may impact host replacement strategies - If we switch to arbitrary assignment later, we need to convert existing .sia files to add the shard index
Arbitrary assignment
Pros:
- More flexible with respect to different upload/download algorithms
- Files can be stored across more than
len(shards)
hosts - If we switch back to strict assignment later, we can set all shard indices in the format to 0 and they will be compressed away
Cons:
- Format must store a
ShardIndex
in eachBlobEntry
, and possibly aChunkIndex
as well - Determining which host is storing a given shard is O(n)
- If files are stored across more than
len(shards)
hosts, then determining which hosts are storing a given chunk is O(n) - if we switch back to strict assignment later, we need to keep all the arbitrary assignment code for compatibility with existing .sia files
I will add more as I think of them.
Broadly speaking, the performance difference seems like a wash; it's difficult to know in advance which approach will yield the fastest upload/download performance. Format-wise, strict assignment consumes less space, but arbitrary assignment makes compatibility easier if we switch to strict assignment later. When replacing hosts, arbitrary assignment appears more flexible, but the full implications of this are unclear. Lastly, both approaches support file packing (storing chunks from multiple files in one host sector).
I also want to make one point clear in case it isn't already, which is that in strict assignment, the host->shard mapping may still vary from file to file. That is, host j does not necessarily hold shard j for every file being stored by the renter.
We've talked about this some offline, and there is really just one major conclusion: data for the same chunk needs to be nested into the same part of the file, which means the format for the file data / chunk data needs to be adjusted so that all the chunks come in a row. This probably means having some part of the header dedicated to listing the different hosts, ip addresses, and pubkeys, and then each chunk can reference which host it's using with a 1 byte number. At least then it's still fairly well compressed.
@rudibs we would want the data in this file to be more or less posix compliant, though it's not a filesystem, it's just a list of files like .tar. But it should have all the necessary metadata.
Another assumption that may have been in place when creating this proposal was that this file format would be modified in real time as pieces and chunks get uploaded or modified. With the new database upgrade for the filesystem, this is no longer going to be true. These files are to be extracted from the database.
The above photo is a chunk layout. 8 bytes for EC data (what type of code, how many pieces, etc). The next 8 bytes point to an offset in an extension file that contains extra pieces. It's set to '0' if there are no extra pieces, which will be the case for most chunks. Then what follows is a list of pieces, which have 4 bytes for a crypto key nonce, 2 bytes that say what file contract (a lookup table is kept separately to find the full file contract id), and finally 10 bytes which are the first 10 bytes of the Merkle root that can be used to query the host.
We will need some protocol changes to get this far, so initially we will need 32 bytes for the full root.
Because each chunk is constant size, you can know where in the file to find a chunk in constant time, without keeping a lookup table. For 50 pieces, each chunk is about 2kb on disk for 40MB of data stored in Sia. After we implement partial root lookup, it's less than 1kb per chunk.
In the discussion above you mention "Files can be shared with anyone who has contracts with at least Index.MinShards of the hosts." - What if the instance importing a shared .sia file is missing contracts? Will the renter automatically form new contracts by a already implemented mechanism? Or is this addressed in a separate issue (didn't find anything).
Also, for discussion: If sia gets to the point where a renter can restrict the host selection (e. g. by region), is there already a plan how to handle such scenario? Let's say if the sharing instance uses world wide host selection but the the receiving instance only allows hosts from a specific region? This might break the Index.MinShards available prerequisite because the renter won't be able to form contracts with needed hosts due to region restriction (of course this might apply to other host selection criteria as well like pricing, availability, ....).
@chrsch good questions. If the recipient doesn't have contracts with at least MinShards
hosts storing the file, they won't be able to download it. To rectify this, either the recipient must form new contracts, or the sender must replicate shards of the file to additional hosts.
Forming new contracts is expensive, so renters will avoid doing so if they can avoid it. I could see this leading to some degree of centralization: in order to minimize the number of new contracts that need to be formed, the most rational behavior for renters is to pick some subset of hosts (>= MinShards
) that almost every renter has contracts with. The region selection you mentioned impacts this as well: it means that this subset can't be limited to a single region.
One way to get combat this is a sort of "Lightning Network" for hosts, allowing you to pay hosts even if you don't have contracts with them. In the simplest example, you have a contract with host A, but the file is only stored on host B. However, host A has a payment channel with host B. So instead of paying B directly, you pay A, who then pays B on your behalf. In the general case, host A does not need to have a direct payment channel with host B; it just needs a way to route payments to host B, where the route potentially involves many hops.
This is all very speculative, but the basic principle of using hosts as hubs is sound: hosts by definition have properties that make them good hubs, like high availability and bandwidth.
One more thing: it's quite possible that "true" filesharing on Sia turns out to be uncommon in practice. Instead, a middleman could implement a traditional filesharing platform on top of Sia (e.g. sia.pixeldrain.com) and handle payments out-of-band. In this scenario, direct P2P filesharing would still exist, but would likely only be used by enthusiasts or people with special circumstances.