cachey icon indicating copy to clipboard operation
cachey copied to clipboard

Distributed cache strategy

Open AlexMikhalev opened this issue 9 years ago • 7 comments

Hello Fellows, I am looking at using dask for distributed computation and I was wondering what is your strategy to make cachey into horizontally scalable cache and common calculation space?

Regards, Alex

AlexMikhalev avatar Mar 15 '16 20:03 AlexMikhalev

Are you already familiar with dask's distributed scheduler at http://distributed.readthedocs.org/en/latest/ ?

mrocklin avatar Mar 15 '16 20:03 mrocklin

Hello Matt. Yes, I have seen distributed. What I am after is some sort of key value storage for storing results and available on multiple nodes locally. Is it something inside distributed? A

AlexMikhalev avatar Mar 18 '16 10:03 AlexMikhalev

The distributed scheduler maintains distributed data indexed by key in the active memory of multiple nodes. It doesn't persist to disk but easily could by running some function against the data. I would not recommend using distributed as a persistent key-value store or as a database but instead as a staging ground for distributed computations. For persistence I would generally recommend some sort of parallel storage solution like HDFS, Ceph, or a more traditional Posix file system. It tends to be fairly easy to write ingest functions to read from these with distributed in a data local way.

In [1]: from distributed import Executor

In [2]: e = Executor('127.0.0.1:8786')

In [3]: futures = e.scatter(range(10))

In [4]: futures
Out[4]: 
[<Future: status: finished, key: f1b37646b1331806c57001fca979724e>,
 <Future: status: finished, key: 8068024105615db5ba4cbd263d42a6db>,
 <Future: status: finished, key: d1d6feb3b70239a6081e64b7272d18c8>,
 <Future: status: finished, key: 613f9ec13ceecaed260a81ade5f37311>,
 <Future: status: finished, key: 205604040f767e399cb8b3c9322abdaa>,
 <Future: status: finished, key: 7b4b5c735acdd7ee3b2fafd41bc5be78>,
 <Future: status: finished, key: 3ab29504c46cb29e851081c84b5e4249>,
 <Future: status: finished, key: f7aa9090f9d4af3b10066e98993388de>,
 <Future: status: finished, key: 25e70ec51f367dbbff88d67a9684a5a6>,
 <Future: status: finished, key: 42cb136fea0b4175909749d31c9a6c84>]

In [5]: e.has_what()
Out[5]: 
{'192.168.1.141:33088': ['d1d6feb3b70239a6081e64b7272d18c8',
  '25e70ec51f367dbbff88d67a9684a5a6',
  '8068024105615db5ba4cbd263d42a6db',
  '613f9ec13ceecaed260a81ade5f37311',
  'f7aa9090f9d4af3b10066e98993388de',
  '205604040f767e399cb8b3c9322abdaa',
  '7b4b5c735acdd7ee3b2fafd41bc5be78',
  '3ab29504c46cb29e851081c84b5e4249'],
 '192.168.1.141:52787': ['f1b37646b1331806c57001fca979724e',
  '42cb136fea0b4175909749d31c9a6c84']}

In [6]: def inc(x):
    return x + 1
   ...: 

In [7]: futures2 = e.map(inc, futures)

In [8]: e.has_what()
Out[8]: 
{'192.168.1.141:33088': ['inc-6a9bbed6c5c042aac23ac3739b3080ab',
  'inc-80a8293859f4e85484b3fa9f5ae37e66',
  '3ab29504c46cb29e851081c84b5e4249',
  'inc-a684480530c8117e48025ba71b62a8fb',
  '25e70ec51f367dbbff88d67a9684a5a6',
  'd1d6feb3b70239a6081e64b7272d18c8',
  'inc-9824ed6eb573348151189779d5fa084a',
  '8068024105615db5ba4cbd263d42a6db',
  '613f9ec13ceecaed260a81ade5f37311',
  'inc-585e542b4025afe01b51bbab1e738cf6',
  'f7aa9090f9d4af3b10066e98993388de',
  'inc-b7ea7bc3d40dfc9dbaaab71d5a9f7a0a',
  'inc-7348f0bb324a145903cbe7c249e062e4',
  '205604040f767e399cb8b3c9322abdaa',
  '7b4b5c735acdd7ee3b2fafd41bc5be78',
  'inc-4ade3c6e95e45a1afc31f887a57733f5'],
 '192.168.1.141:52787': ['f1b37646b1331806c57001fca979724e',
  '42cb136fea0b4175909749d31c9a6c84',
  'inc-b9caf9d500d82c4b712e016a1263da32',
  'inc-80d3fd1f19423d863391d5d468b29e54']}

mrocklin avatar Mar 18 '16 14:03 mrocklin

Thank you. I was thinking about plugging something like tarantool or Riak.

AlexMikhalev avatar Mar 18 '16 15:03 AlexMikhalev

That could be a very fun combination

mrocklin avatar Mar 18 '16 22:03 mrocklin

Matt, I am interested in your opinition, I was inspired by http://mxnet.readthedocs.org/en/latest/distributed_training.html#how-to-write-a-distributed-program-on-mxnet and I think Tarantool is an interesting storage for high cost values.

AlexMikhalev avatar Mar 19 '16 20:03 AlexMikhalev

Sorry to necro this old thread, but figured this might be worth mentioning. Zarr recently added a Redis store ( https://github.com/zarr-developers/zarr/pull/372 ), which is an in-memory key-value store. It implements the MutableMapping API. So should be a nice thing to use with Cachey.

jakirkham avatar Mar 05 '19 06:03 jakirkham