funflow icon indicating copy to clipboard operation
funflow copied to clipboard

A general question about the terminology of a content-addressable storage

Open CMCDragonkai opened this issue 5 years ago • 5 comments

The content store used here and in Nix, is a store where the data is addressed by some hash. But this hash is not calculated by the content of the data, but by some reproducible expression that produces the data.

Is it still correct to call it a content-addressable storage, or is there some better name for this or is there some qualifier that qualifies this content-addressable storage as something different? The wikipedia article for CAS only mentions that the hash is built from the content that is being stored, not some input expression that produces the content.

Some other thoughts that I'm hoping others can help put in a more formal way:

Looking at this in a more general way. There is some notion of equivalence between the expression that produces a value, and the value itself. However the mapping is surjective, as many kinds of input expressions may map to the same value (which is why canonicalisation/normalisation of the input expression is useful). We assume that hashing the expression and looking it up is faster than running the expression and hashing the resulting value and looking it up.

CMCDragonkai avatar Mar 12 '19 04:03 CMCDragonkai

The hash in funflow is precisely calculated from the output data. The hash in nix is calculated from just the input expression. This is sometimes known as the difference between an "intensional" and "extensional" store (see section 6 of the thesis) and is one of the things that differentiates funflow's store from nix's store.

mpickering avatar Mar 12 '19 08:03 mpickering

@mpickering Oh in that case, how come it says this on your README:

Funflow's caching is based around the content store. This stores all artifacts according to the hash of the inputs and computations which produced them. Funflow uses this to know when exactly a computation must be rerun and when previous results can be re-used.

CMCDragonkai avatar Mar 12 '19 09:03 CMCDragonkai

@CMCDragonkai funflow uses two kinds of hashes. It hashes the inputs to a task and links them to the resulting output so that we can cache computations. It also uses content hashes to deduplicate the store.

The next paragraph in the README points to the CAS aspect

The content store also acts as a CAS system. This means that, if multiple inputs produce the same output, that output will be stored only once on disk (and subsequent computations will not be rerun).

aherrmann avatar Mar 12 '19 10:03 aherrmann

I've reread the nix phd thesis on intensional store, I understand what you mean now. @aherrmann Can you point out in the code where is the table/datastructure that links the extensional hash with the intensional hash?

CMCDragonkai avatar Mar 18 '19 04:03 CMCDragonkai

@CMCDragonkai Sure, the links are stored on the file system as symbolic links. These are created here. On lookup we check for such a symbolic link and if it exists resolve it to determine the CAS item here.

aherrmann avatar Mar 18 '19 09:03 aherrmann