dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Split data into blocks

Open efiop opened this issue 6 years ago โ€ข 28 comments

As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.

efiop avatar Jun 30 '18 01:06 efiop

For a private data sharing/deduplication tool I've used a buzhash/rolling hash for chunking big files into small chunks pretty successfully. The borg backup tool has a nice implementation of a buzhash chunker. Maybe it's worth checking out.

sotte avatar Nov 14 '18 15:11 sotte

@Witiko noted that there is ficlonerange that we could use to create file by reflinking smaller files into it. https://github.com/iterative/dvc/issues/2073#issuecomment-497864614

efiop avatar Jun 01 '19 04:06 efiop

This is a pretty good solution to the chunking deduplication problem. I've used it with success for backing things up: https://duplicacy.com/. If anything you could use the algorithm.

salotz avatar Dec 12 '19 03:12 salotz

the way Git handles this is

looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next See https://git-scm.com/book/en/v2/Git-Internals-Packfiles

So that would be another possible approach โ€” equivalent to chunking down to the byte. Mentioned in #1487 BTW:

  1. Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.

Either approach complicates the file linking from cache to workspace anyway.

jorgeorpinel avatar Sep 02 '20 21:09 jorgeorpinel

As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.

@efiop There is a problem with it that if the new line was inserted on top of the file, or in the middle.

karajan1001 avatar Sep 27 '20 10:09 karajan1001

problem with it that if the new line was inserted on top of the file, or in the middle.

There are methods that can deal with inserts at various positions. I mentioned this earlier: https://github.com/iterative/dvc/issues/829#issuecomment-438705585

Most of the chunks/blocks of a file stay the same, only chunks/blocks that change are actually added. Many backup tools use similar methods.

sotte avatar Sep 27 '20 11:09 sotte

For the record: another thing to consider is to revisit .dir format to be able to split it into smaller files. E.g. 123456.dir could look like (it might still be json or not)

{
   "dir": "654321.dir"
   "foo": "...",
   "bar": "...",
}

so each .dir file contains a listing of the directory (e.g. os.listdir), which will result in us being able to optimize cases when you are adding, say, 1 file to a giant dataset. Currently, that will result in a new .dir file that might be a new like a 30M json file, which might be more than the file you've added.

CC @pmrowla , since we've talked about it previously.

efiop avatar Oct 18 '20 08:10 efiop

In case this is useful, there's Zarr for Python:

Zarr is a format for the storage of chunked, compressed, N-dimensional arrays.

  • Create N-dimensional arrays with any NumPy dtype.
  • Store arrays in memory, on disk, inside a Zip file, on S3, โ€ฆ
  • Organize arrays into hierarchies via groups.

jorgeorpinel avatar Jan 15 '21 02:01 jorgeorpinel

Nobody mentioned IPFS yet. :-p IPFS also stores files in chunks and supports different chunkers, and they're about to (or have already?) changed the default chunker to a data-dependent one.

hmeine avatar Mar 22 '21 14:03 hmeine

For a private data sharing/deduplication tool I've used a buzhash/rolling hash for chunking big files into small chunks pretty successfully. The borg backup tool has a nice implementation of a buzhash chunker. Maybe it's worth checking out.

I just want to +1 this method. I have been doing some tests on a personal project using the restic chunker implementation, but the algorithm is not important. Testing some different datasets and large binary files I had some great results with this method.

Other benefis are that these blocks are ~1-2MB and work very nicely with protocols like HTTP and the resulting data structure is still flat chunks that can be consumed without any implementation / knowledge of the generation process.

I am a little nervous about other diff based approaches mentioned and the implications on garbage collection and inter-file dependencies created. Is there any current work on this and/or are you leaning towards any specific approach? I am doing personal research into external tooling that could be used on top of DVC but it would be great if generating this type of data structure was included in DVC itself.

bobertlo avatar Nov 10 '21 17:11 bobertlo

@bobertlo Great info! We didn't really pick a specific approach yet, as we've been mostly busy with rebuilding our data management architecture to suit any new features like that. Right now the internals are somewhat ready for working on chunking specifically (e.g. objects/checkout/staging/db/etc in dvc/objects), but we need to finish up a few other things, so we are planning to start implementing chunking in Q1 2022.

Regarding 1-2MB chunks, that might get to slow to transfer, unless we pack them into some kind of packs. This problem is exactly what we are seeing with large image datasets, where you have like a million of 1-2MB images, that each currently require at least an API call to upload/download to/from cloud, which is slow. This is why we plan on tackling both problems at the same time: big file chunking and big dataset packing. Overall there is a (unconfirmed) feeling, that for our typical data sizes we probably need something bigger than 1-2MB chunk or pack size. We will be researching it closer a bit later.

Btw, happy to jump on a call again to chat about it ;)

efiop avatar Nov 10 '21 18:11 efiop

@efiop Great! I'm excited to follow this development :) RE: the chunk size, it is fairly arbitrary. Most existing implementations have interfaces for adjusting the match threshold on the hash and/or concatenating/truncating adjacent chunks.

I just want to share some proof of concept tests I ran on simulated datasets. To synthesize these, I pulled some (related) docker images, installed packages to create a couple more and then dumped them all to tarballs.

$ ls -sh images/
total 9.6G
670M base.tar  2.7G dvc.tar  2.5G librosa.tar  1.4G minimal.tar  2.5G scipy.tar

In this test the files are chunked and a list of chunks is stored as [original file hash].chunks and chunks are stored as [chunk file hash]. I did one run with plain chunks and another run with zstd streaming on each chunk for fun.

$ du -sh testrepo*
3.3G    testrepo
855M    testrepo-zstd

$ find testrepo -type f | wc -l
3871

I realize it is a BIG deal (especially with garbage collecting) to actually implement this but the results here look very promising and have really great implications on possible workflows.

bobertlo avatar Nov 11 '21 14:11 bobertlo

zstd, Yes this reminds me that in previous we didn't use compression for data storage because it will make the current quick checkout disabled. As the splitting of data will also break the quick checkout, maybe it is time to reconsider it? As it will save the storage and accelerate the data transfer?

karajan1001 avatar Nov 12 '21 09:11 karajan1001

Hi, is there a more formal definition of this feature? I have some questions, especially about what is not included in this feature.

I suppose "chunking" is just splitting the files into smaller pieces so that you do not have to store/download/upload duplicate pieces.

And I suppose this does not have to do with downloading/uploading chunks in parallel from different remote storages, right?

  1. Does DVC support transferring files in parallel from the same remote storage? (current state without chunks)
  2. Dows DVC support transferring files in parallel from the all remote storages? (current state without chunks)
  3. Does DVC support transferring chunks in parallel from the same remote storage?
  4. Does DVC support transferring chunks in parallel from all remote storages?

I assume the answer is NO for all of them.

And I suppose that adding this feature will make it impossible to implement this one: Cloud versioning.

josecelano avatar Jul 11 '22 08:07 josecelano

@bobertlo [W]e are planning to start implementing chunking in Q1 2022.

Can you give a short update on how high or low prioritized this issue is right know, please? Even a second guess of time schedule is highly appreciated. Or what are the results of your research in the current milestone?

At the moment, the chunking is crucial for further usage of dvc. We are going to automate our MLOps pipeline which will update dataset of approx. 20GB only slightly based on customer feedback every week. Without chunking but with re-uploading every file entirely, this would produce an overhead of 1TB per year!

Chickenmarkus avatar Jul 11 '22 11:07 Chickenmarkus

@josecelano

  1. yes
  2. no, we still process remotes one-by-one, but everything from one remote is transfered in parallel
  3. yes, same as 1
  4. no, same as 2

Correct about cloud versioning. It could still be used as optimization (we can use version id for particular chunk file), but that complicates it. So this could be an option. But we'll need to take a closer look.

efiop avatar Jul 11 '22 11:07 efiop

@Chickenmarkus We've been working on some pre-requisites like https://github.com/iterative/dvc-data and https://github.com/iterative/dvc-objects, which will host the forementioned functionality. We might get to this in Q4, but I'm not 100% sure, as we have other more product-oriented priorities.

efiop avatar Jul 11 '22 11:07 efiop

At the moment, the chunking is crucial for further usage of dvc. We are going to automate our MLOps pipeline which will update dataset of approx.

@Chickenmarkus I'd appreciate it if you can share more details about your use case:

  1. What is the type and format of data: csv, images or something else.
  2. How does a typical data update looks like: append data to a file or change in any part of the file?

dmpetrov avatar Jul 11 '22 15:07 dmpetrov

@dmpetrov Yes, sure. Every real feedback you have will finally serve us users. :smile:

  1. i) type & ii) format
    1. We have records which consists of multi-line text, numbers, some list of numbers which represents encoded tags and discrete labels. For our use case, the records are unordered (no ordering by ID or timestamp possible).
    2. For now, we use and are fine with CSV. The multi-line text can extend a single record to several lines. Due to the size of the dataset, it is splitted into several CSV files. Some binary formats such as Parquet or TFrecords are in discussion in order to ensure types or optimize lazy loading if the dataset does not fit into memory anymore.
  2. We have two typical patterns of change:
    1. Append new records. This is the very most often one. Creating a new file every time we append new records so that the new file is exclusively uploaded gets a mess when it comes to ii.
    2. Update the label of an existing record. Basically, that's changing a few characters in any single line of the CSV.

At the end, it is very challenging to efficiently chunk this randomized data by a generic approach. :see_no_evil: However, the overhead of versioning big files will significantly be reduced if 2.i is covered at least.

Chickenmarkus avatar Jul 12 '22 14:07 Chickenmarkus

@Chickenmarkus a feedback to your setup is below:

First, Update the label of an existing record can be a root cause of multiple issues. Data and labels have a different lifecycle. The best practice is the separation of labels from data and joining them when it is needed.

We are working on a DVC-sister project LDB/LabelDB to solve the problem of managing labels. It is in an early stage but happy to sync up and give you an early access if there is an interest (please share your email).

In simple case, you can just store a data CSVs and labels CSVs .

Second, Parquet or TFrecords is the right direction to optimize the evolving tabular data formats. However, the versioning still can be not easy to get out of the box.

Delta Lake looks like a great way of versioning on top of Parque. Luckily there is a Rust implementation with Python bindings https://github.com/delta-io/delta-rs#python which makes it possible to integrate Delta to DVC (without dependencies to Spark and JVM ๐Ÿ˜…) and manage tabular data efficiently.

I'd love to hear your feedback.

dmpetrov avatar Jul 26 '22 10:07 dmpetrov

@Chickenmarkus a feedback to your setup is below: [...] I'd love to hear your feedback.

Sorry for the late reply, the notification did not work first, and second I was on vacation. :see_no_evil:

Also, I would really like to thank you for your feedback! It is very helpful. I absolutely agree with you. The separation of data and labels is completely new to me but it seems to be the right way. I only see the reasons for that you already mentioned. There comes nothing to my mind where this would break the workflow or similar. I will forward this proposal into the company. Also, thank you for your offer about the early access to LabelDB but I have already to many projects in progress. :face_exhaling: I also know Delta Lake yet - a really fancy technology. However, we also have to care about "How to read the data into Tensorflow batch-wise if it does not fits into memory." Tensorflow comes with already implemented importers for CSV or TFRecords. This saves a lot of development work on our side. :angel:

Chickenmarkus avatar Aug 26 '22 14:08 Chickenmarkus

Another motivation for implementing data chunking:

When we manage image data for, e.g., image classification, we have a folder for each class which contains the images associated with the class. Thus, we have as many files as we have samples in the dataset which can be anywhere between O(10,000) and millions. In our case, we use an HTTP storage backend for DVC which has a rate limit of 1000 requests per 15 seconds. With highly parallel download, we are bounded by this rate limit which means for, e.g., 1,000,000 samples it would take (1,000,000 รท 1000) ร— 15s โ‰ˆ 4h. For comparison, let's assume the image size on average is 100 KB. Then 1,000,000 images would amount to 100 GB. With a conservative bandwidth of 100 MBit/s, downloading 100 GB would take 100 GB รท 0.0125 GB/s โ‰ˆ 2h. And if we assumed a bandwidth of 1 GBit/s, the download time would reduce to 12 min.

So, given our rate limit, our effective bandwidth is severely limited. But if DVC supported data chunking, we could significantly improve our throughput with probably some but not a total loss of deduplication.

sisp avatar Mar 23 '23 08:03 sisp

@sisp thank you for the feedback!

This problem can be solved on different levels. I have some concerns about using data chunking in this particular case because it requires splitting data by smaller chunks. For example, for 1M images we would need to split it to X (1M <= X < 100M) smaller chunks depending on image sizes and we hit the throughput limits even harder.

Instead, I believe a better solution could be the exact opposite approach, where we use a single archive file (per class or just a subset of images) that can be fetched partially. This way, we could "batch" requests. For example, we can obtain all the images with just one request or retrieve a subset of N images with M <= N queries (more images you ask for - smaller M you get). We are currently working on a solution using the webdataset format, which is based on unarchived tar files and allows partial retrieval of files from S3/GCP/ADLS/NFS.

I would love to discuss this further with you in our next meeting. Thanks again for your suggestion!

dmpetrov avatar Apr 05 '23 02:04 dmpetrov

@dmpetrov Thanks for your feedback! :pray: Indeed, splitting/chunking image files even further would make the problem worse. TBH, I haven't read i to the algorithms details of some tbe above-mentioned chunkers (like the one BorgBackup is using), but I imagined the chunking wouldn't necessarily mean splitting files but, e.g. for small files, could also mean combining files. I had a picture of HDFS' 64 MB blocks in my mind somehow. But this might not make sense at all, I can't actually say without diving deeper into the algorithmic details.

Could you elaborate just a little on your work towards partial retrieval of data from an archive? I'm absolutely open to managing data at a less granular level, e.g. all samples per class gathered in a single file, but currently I'd (a) loose deduplication when only some samples change because the whole file gets reuploaded, (b) I need to retrieve the whole file even when I only want a subset (which might be adressed by the work you've mentioned?), and (c) I need to decide on the granularity and content slice of the file/archive manually (which might be unavoidable because the optimal strategy might depend on data access and mutation patterns, and I don't know whether there is a generic near-optimal solution to this problem; it feels a bit related to RDMS index creation).

I'm afraid I won't be able to join our meeting today because I'm already on Easter vacation. But I think the main topic is a different one anyway. I'd be happy to continue discussing here, or if you think it makes sense to also chat about it synchronously, we could schedule another meeting. ๐Ÿ™‚

sisp avatar Apr 05 '23 06:04 sisp

Could you elaborate just a little

Sure! First, the webdataset based approach works well only for "append-only" use cases. For example, you will be adding each new batch of images as a new tar file to a class directory but not deleting the images. Physical file deletion has a huge overhead, logical (without touching the archives) is not a problem but it lies on shoulders of user.

(a) loose deduplication

It depends on how you "package" the archive. It is possible not to loose it and even get local file caching (per image file).

(b) I need to retrieve the whole file even when I only want a subset (which might be adressed by the work you've mentioned?)

You can retrieve only files you need. Example: tar with 4 files f1, f2, f3, f4. 1MB each file. If you "know" the archive structure you can download only required files f2 and f3 - only 2MB out of 4MB tar file. Moreover, if you are lucky enough it will optimize the number of requests - f2 and f3 can be downloaded in one request if they next to each other in the archive. Bigger %% of files that you ask - more luck you get ๐Ÿ€

(c) I need to decide on the granularity and content slice of the file/archive manually

Right. It is doable and you can address files by the archive name and image file name in it.

it feels a bit related to RDMS index creation

Yes. "know the archive structure" from the above means "index".

I'm already on Easter vacation

Let me chat with your folks and it can clarify what is the best way for us to communicate ๐Ÿ™‚

Happy easter ๐Ÿฐ

dmpetrov avatar Apr 05 '23 06:04 dmpetrov

Thanks for your detailed reply, this is very helpful. ๐Ÿ™

Managing files in an archive including batched download of adjacent files etc. is something you're working on adding to DVC, but it's not yet possible. Right?

The append-only property sounds like a reasonable requirement to me. Typically, new data becomes available, gets appended to a dataset, and becomes available in a new release. For structured data, when the data structure changes (new field/column, changed representation of a field/column, etc.), all samples might need to be updated and uploaded without an opportunity for deduplication anyway. The only case I'm still a bit concerned about is fixing some samples in a datset, e.g. when NaN values remained in v1.0.0 and were fixed (by whatever method) in v1.0.1. In this case, only some samples would be updated. With one sample per file, only the updated files would get uploaded again, but with archives I imagine the whole archive would have to be uploaded although most samples might not have changed. Right?

Let me chat with your folks and it can clarify what is the best way for us to communicate ๐Ÿ™‚

Happy easter ๐Ÿฐ

Sounds good. ๐Ÿ‘Œ Happy Easter to you, too. ๐Ÿฐ

sisp avatar Apr 05 '23 07:04 sisp

adding to DVC, but it's not yet possible. Right?

Well, it is actually DVC-sister project that we are working on. The new product is designed for genAI use cases.

The append-only property sounds like a reasonable requirement to me. ... fixing some samples in a datset

Great! And you are right about the data fixes. However, the majority of fixes are happening on meta-data/label level rather than files/images. The meta-data fixes is not an issue at all if you manage it separately from the data (which is a fairly common pattern).

with archives I imagine the whole archive would have to be uploaded although most samples might not have changed.

It depends on use cases. In some cases, virtually removing a subset of images from a dataset is enough without any changes in the files or archives. All can be done in the meta data level.

PS: I had a great chat with your teammate recently. I hope to see you both in the next meeting ๐Ÿ™‚

dmpetrov avatar Apr 07 '23 18:04 dmpetrov

Hello, are there any news about this feature?

Rob174 avatar Feb 27 '24 14:02 Rob174