gokart
gokart copied to clipboard
[Feature Request] add support for `.lmdb`
I would like to add .lmdb
to the file formats supported by TaskOnKart.make_target(). .lmdb
is the format used by several popular datasets, and is actually suitable for handling large datasets (especially images).
For those who are not familiar with .lmdb
, here is a brief explanation of its usage. .lmdb
is a Key-Value store whose values can be retrieved via the lmdb.Environment
object:
import lmdb
# open environment
env = lmdb.open('foo')
# put item
key: bytes = ...
value: bytes = ...
with env.begin(write=True) as txn:
txn.put(key, value)
# get item
with env.begin() as txn:
_value = txn.put(key)
assert value == _value
The following changes are necessary to support .lmdb
:
- Add
LmdbFileProcessor
:
class LmdbFileProcessor(FileProcessor):
def __init__(self, file_path: str) -> None:
self.__file_path = file_path
super(LmdbFileProcessor, self).__init__()
@property
def env(self) -> lmdb.Environment:
# To avoid creating `env` in the constructor for paths other than .lmdb.
if not hasattr(self, '__env'):
self.__env = lmdb.open(self.__file_path, subdir=False, lock=True)
return self.__env
def load(self, file):
with self.env.begin(write=False) as txn:
for k, v in txn.cursor():
yield k, v
def dump(self, obj):
key, value = obj
with self.env.begin(write=True) as txn:
txn.put(key, value)
...
def make_file_processor(file_path: str, store_index_in_feather: bool) -> FileProcessor:
extension2processor = {
...
'.lmdb': LmdbFileProcessor(file_path)
}
- Prevent
SingleFileTarget
from opening files before dumping.
def _dump(self, obj) -> None:
if self._processor.__class__.__name__ == 'LmdbFileProcessor':
self._processor.dump(obj)
else:
with self._target.open('w') as f:
self._processor.dump(obj, f)
- Write tasks!
class DownloadImages(GokartTask):
def run(self):
image_ids = self.load()
for image_id in image_ids:
image: bytes = imread(image_id)
self.dump((image_id.encode(), image))
def output(self):
return self.make_target(f'{self.__class__.__name__}.lmdb')
class LoadImages(GokartTask):
def requires(self):
return DownloadImages()
def run(self):
for key, value in self.load(): # load keys and values of previous db
# do something with the images
...
They work perfectly in my environment, but there are a few untested concerns:
- It may violate the design philosophy of gokart and luigi to call
dump()
only once at the end ofrun()
. - It may cause problems with parallelization.
I would like to commit what I have sorted out here as a Pull Request, and I would like to discuss it in this issue.
Cool feature :)
Mostly I agree to this proposal.
It may violate the design philosophy of gokart and luigi to call dump() only once at the end of run().
I think this is not a problem. With current implementation, we can dump multiple times with named targets as follows
class Task(GokartTask):
def requires(self):
return dict(
a=self.make_target("a.pkl"),
b=self.make_target("b.pkl"),
)
def run(self):
self.dump(1, "a.pkl")
self.dump(2, "b.pkl")
Another discussion is whether lmdb should be implemented in SingleFileTarget
or create another LmdbTarget
:)
Thank you for your reply.
For now, I have implemented a way to use SingleFileTarget
to support lmdb, and passed the test locally.
https://github.com/S-aiueo32/gokart/tree/4d65e5f359a44d4113945dd82fc2c3171fa973da
In LocalTargetTest
, there was an error around the locking of lmdb, so I made sure not to lock it when opening lmdb.environment
(probably not a problem since it is locked in other parts of gokart).
Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)
I see. It is indeed awkward to have a conditional branch depending on the processor type in SingleFileTarget
. It would be better to create a new LmdbTarget
that inherits from SingleFileTarget
, and rewrite load
and dump
.
If we create a new LmdbTarget
, do you have an opinion on whether it is preferable to branch in make_target
or create a new function like make_lmdb_target
?
@S-aiueo32 Thank you for proposing great feature!
It would be better to create a new LmdbTarget that inherits from SingleFileTarget
I think LmdbTarget should inherite TargetOnKart
just like SingleFileTarget
or ModelTarget
, not to make overcomplicating inheritance dependencies.
do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target
I think creating new function make_lmdb_target
is better, just like existing make_model_target
.
@S-aiueo32 Hi, This is just a friendly reminder (not urgent). What's the status of this issue?