gokart icon indicating copy to clipboard operation
gokart copied to clipboard

[Feature Request] add support for `.lmdb`

Open S-aiueo32 opened this issue 2 years ago • 4 comments

I would like to add .lmdb to the file formats supported by TaskOnKart.make_target(). .lmdb is the format used by several popular datasets, and is actually suitable for handling large datasets (especially images).

For those who are not familiar with .lmdb, here is a brief explanation of its usage. .lmdb is a Key-Value store whose values can be retrieved via the lmdb.Environment object:

import lmdb

# open environment
env = lmdb.open('foo')

# put item
key: bytes = ...
value: bytes = ...
with env.begin(write=True) as txn:
    txn.put(key, value)

# get item
with env.begin() as txn:
    _value = txn.put(key)
    assert value == _value

The following changes are necessary to support .lmdb:

  1. Add LmdbFileProcessor:
class LmdbFileProcessor(FileProcessor):

    def __init__(self, file_path: str) -> None:
        self.__file_path = file_path
        super(LmdbFileProcessor, self).__init__()

    @property
    def env(self) -> lmdb.Environment:
        # To avoid creating `env` in the constructor for paths other than .lmdb. 
        if not hasattr(self, '__env'):
            self.__env = lmdb.open(self.__file_path, subdir=False, lock=True)
        return self.__env

    def load(self, file):
        with self.env.begin(write=False) as txn:
            for k, v in txn.cursor():
                yield k, v

    def dump(self, obj):
        key, value = obj
        with self.env.begin(write=True) as txn:
            txn.put(key, value)

...

def make_file_processor(file_path: str, store_index_in_feather: bool) -> FileProcessor:
    extension2processor = {
        ...
        '.lmdb': LmdbFileProcessor(file_path)
    }
  1. Prevent SingleFileTarget from opening files before dumping.
    def _dump(self, obj) -> None:
        if self._processor.__class__.__name__ == 'LmdbFileProcessor':
            self._processor.dump(obj)
        else:
            with self._target.open('w') as f:
                self._processor.dump(obj, f)
  1. Write tasks!
class DownloadImages(GokartTask):
    def run(self):
        image_ids = self.load()
        for image_id in image_ids:
            image: bytes = imread(image_id)
            self.dump((image_id.encode(), image))

    def output(self):
        return self.make_target(f'{self.__class__.__name__}.lmdb')

class LoadImages(GokartTask):
    def requires(self):
        return DownloadImages()

    def run(self):
        for key, value in self.load():  # load keys and values of previous db
            # do something with the images
            ...

They work perfectly in my environment, but there are a few untested concerns:

  1. It may violate the design philosophy of gokart and luigi to call dump() only once at the end of run().
  2. It may cause problems with parallelization.

I would like to commit what I have sorted out here as a Pull Request, and I would like to discuss it in this issue.

S-aiueo32 avatar Dec 17 '21 11:12 S-aiueo32

Cool feature :)

Mostly I agree to this proposal.

It may violate the design philosophy of gokart and luigi to call dump() only once at the end of run().

I think this is not a problem. With current implementation, we can dump multiple times with named targets as follows

class Task(GokartTask):
    def requires(self):
        return dict(
           a=self.make_target("a.pkl"),
           b=self.make_target("b.pkl"),
       )

    def run(self):
      self.dump(1, "a.pkl")
      self.dump(2, "b.pkl")

Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)

Hi-king avatar Dec 17 '21 15:12 Hi-king

Thank you for your reply.

For now, I have implemented a way to use SingleFileTarget to support lmdb, and passed the test locally. https://github.com/S-aiueo32/gokart/tree/4d65e5f359a44d4113945dd82fc2c3171fa973da In LocalTargetTest, there was an error around the locking of lmdb, so I made sure not to lock it when opening lmdb.environment (probably not a problem since it is locked in other parts of gokart).

Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)

I see. It is indeed awkward to have a conditional branch depending on the processor type in SingleFileTarget. It would be better to create a new LmdbTarget that inherits from SingleFileTarget, and rewrite load and dump.

If we create a new LmdbTarget, do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target?

S-aiueo32 avatar Dec 21 '21 04:12 S-aiueo32

@S-aiueo32 Thank you for proposing great feature!

It would be better to create a new LmdbTarget that inherits from SingleFileTarget

I think LmdbTarget should inherite TargetOnKart just like SingleFileTarget or ModelTarget, not to make overcomplicating inheritance dependencies.

do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target

I think creating new function make_lmdb_target is better, just like existing make_model_target.

mski-iksm avatar Dec 21 '21 14:12 mski-iksm

@S-aiueo32 Hi, This is just a friendly reminder (not urgent). What's the status of this issue?

hirosassa avatar Apr 08 '22 00:04 hirosassa