deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

[FEATURE] Option to disable auto commit after data ingestion

Open HyunggyuJang opened this issue 1 year ago • 1 comments

Description

Currently, versions are made upon data ingestion with the following code: https://github.com/activeloopai/deeplake/blob/2ad84c139e91fe6f81da9c8b6d4f48c9d3ee8e73/deeplake/core/vectorstore/deeplake_vectorstore.py#L294

It seems like every time the commit is made, the full dataset of current state is captured as a corresponding version. So, if the user commits a lot, the storage the versions consumes blows up rapidly.

It becomes problematic if the user ingest small data incrementally, i.e., the dataset between versions are almost the same, so consumes space inefficiently.

The canonical solution for this would be to capture only the diff data for each version, but as I'm not acquainted the codebase, don't know whether it is feasible.

So, instead, I suggest to offer an option that users can choose whether they want "auto-commit" or not when ingest a data.

Use Cases

No response

HyunggyuJang avatar Aug 07 '23 03:08 HyunggyuJang

Hey @HyunggyuJang, thanks a lot for raising the issue. We're already working on this, and I'll be sure to let you know when the updates are released.

FayazRahman avatar Aug 07 '23 18:08 FayazRahman