DataSets.jl
DataSets.jl copied to clipboard
Using data handles with Distributed.jl
DataSets is kind of hard to use with Distributed.jl.
The main usability problem is that it's impossible to deserialize handles like Blob or BlobTree when they rely on the availability of local resources such as disk caches which are only available on the main node. One is forced into less natural use patterns such as sending keys between the nodes.
Somehow it would be nice to make this more natural.
"Ideally" you'd like to have the dataset open on all nodes, and to transparently hook up any serialized Blob to the local data cache during deserialization. I'm not sure this is really possible, but it's something to aspire to!
The "natural" way to do this is to allow Blob and BlobTree to refer to data on a different node than the local one, perhaps transferring them as RemoteBlob and RemoteBlobTree (perhaps with a parametric type like RemoteData{Blob} and RemoteData{BlobTree}). However, that might not work out great if large amounts of data end up being transferred when a node accesses data on a different node.
Well Blob is already a lazy handle to potentially-remote data where the actual data fetching is delegated to a storage driver instance.
So technically Blob is fine to serialize, it's the storage driver which is the problem because that manages some local resources like a cache directory and a list of files which have been downloaded so far.
For typical data parallel workloads there's probably not much overlap between the data that each node needs, so they should fetch it independently from storage.
Can you transfer the Blob and then detect on the remote side that the Blob doesn't have everything it needs and refetch it on first use or on deserialization?
refetch it on first use or on deserialization?
Yes that's what I meant by
transparently hook up any serialized Blob to the local data cache during deserialization
This is kind of how modules are handled during deserialization. Only a module tag is sent, no module data. The deserializer then uses the tag to look up a local reference to the module which is expected to exist on the receiving side.
I'm not sure Serialization can be extended in this way. Needs some investigation.