cachey
cachey copied to clipboard
Distributed cache strategy
Hello Fellows, I am looking at using dask for distributed computation and I was wondering what is your strategy to make cachey into horizontally scalable cache and common calculation space?
Regards, Alex
Are you already familiar with dask's distributed scheduler at http://distributed.readthedocs.org/en/latest/ ?
Hello Matt. Yes, I have seen distributed. What I am after is some sort of key value storage for storing results and available on multiple nodes locally. Is it something inside distributed? A
The distributed scheduler maintains distributed data indexed by key in the active memory of multiple nodes. It doesn't persist to disk but easily could by running some function against the data. I would not recommend using distributed as a persistent key-value store or as a database but instead as a staging ground for distributed computations. For persistence I would generally recommend some sort of parallel storage solution like HDFS, Ceph, or a more traditional Posix file system. It tends to be fairly easy to write ingest functions to read from these with distributed in a data local way.
In [1]: from distributed import Executor
In [2]: e = Executor('127.0.0.1:8786')
In [3]: futures = e.scatter(range(10))
In [4]: futures
Out[4]:
[<Future: status: finished, key: f1b37646b1331806c57001fca979724e>,
<Future: status: finished, key: 8068024105615db5ba4cbd263d42a6db>,
<Future: status: finished, key: d1d6feb3b70239a6081e64b7272d18c8>,
<Future: status: finished, key: 613f9ec13ceecaed260a81ade5f37311>,
<Future: status: finished, key: 205604040f767e399cb8b3c9322abdaa>,
<Future: status: finished, key: 7b4b5c735acdd7ee3b2fafd41bc5be78>,
<Future: status: finished, key: 3ab29504c46cb29e851081c84b5e4249>,
<Future: status: finished, key: f7aa9090f9d4af3b10066e98993388de>,
<Future: status: finished, key: 25e70ec51f367dbbff88d67a9684a5a6>,
<Future: status: finished, key: 42cb136fea0b4175909749d31c9a6c84>]
In [5]: e.has_what()
Out[5]:
{'192.168.1.141:33088': ['d1d6feb3b70239a6081e64b7272d18c8',
'25e70ec51f367dbbff88d67a9684a5a6',
'8068024105615db5ba4cbd263d42a6db',
'613f9ec13ceecaed260a81ade5f37311',
'f7aa9090f9d4af3b10066e98993388de',
'205604040f767e399cb8b3c9322abdaa',
'7b4b5c735acdd7ee3b2fafd41bc5be78',
'3ab29504c46cb29e851081c84b5e4249'],
'192.168.1.141:52787': ['f1b37646b1331806c57001fca979724e',
'42cb136fea0b4175909749d31c9a6c84']}
In [6]: def inc(x):
return x + 1
...:
In [7]: futures2 = e.map(inc, futures)
In [8]: e.has_what()
Out[8]:
{'192.168.1.141:33088': ['inc-6a9bbed6c5c042aac23ac3739b3080ab',
'inc-80a8293859f4e85484b3fa9f5ae37e66',
'3ab29504c46cb29e851081c84b5e4249',
'inc-a684480530c8117e48025ba71b62a8fb',
'25e70ec51f367dbbff88d67a9684a5a6',
'd1d6feb3b70239a6081e64b7272d18c8',
'inc-9824ed6eb573348151189779d5fa084a',
'8068024105615db5ba4cbd263d42a6db',
'613f9ec13ceecaed260a81ade5f37311',
'inc-585e542b4025afe01b51bbab1e738cf6',
'f7aa9090f9d4af3b10066e98993388de',
'inc-b7ea7bc3d40dfc9dbaaab71d5a9f7a0a',
'inc-7348f0bb324a145903cbe7c249e062e4',
'205604040f767e399cb8b3c9322abdaa',
'7b4b5c735acdd7ee3b2fafd41bc5be78',
'inc-4ade3c6e95e45a1afc31f887a57733f5'],
'192.168.1.141:52787': ['f1b37646b1331806c57001fca979724e',
'42cb136fea0b4175909749d31c9a6c84',
'inc-b9caf9d500d82c4b712e016a1263da32',
'inc-80d3fd1f19423d863391d5d468b29e54']}
Thank you. I was thinking about plugging something like tarantool or Riak.
That could be a very fun combination
Matt, I am interested in your opinition, I was inspired by http://mxnet.readthedocs.org/en/latest/distributed_training.html#how-to-write-a-distributed-program-on-mxnet and I think Tarantool is an interesting storage for high cost values.
Sorry to necro this old thread, but figured this might be worth mentioning. Zarr recently added a Redis store ( https://github.com/zarr-developers/zarr/pull/372 ), which is an in-memory key-value store. It implements the MutableMapping API. So should be a nice thing to use with Cachey.