deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

How to add custom compression / decompression?

Open dotphoton-ziad opened this issue 1 year ago • 15 comments

Hi all! I have all my data stored in an S3 bucket and I would like to use Hub to load my data from S3, yet my data on the cloud is compressed using Jetraw by Dotphoton and I would like it to decompress the images as I am pulling them from the cloud. I would be ready to write code to make this happen, but as I am new to the code base I would like to know where this would fit in the most and where I should start. Thank you all in advance!

dotphoton-ziad avatar Jul 15 '22 13:07 dotphoton-ziad

Hey @ziadomalik would love to accept the contribution to support your compression method. Most of our compression code is written here https://github.com/activeloopai/Hub/blob/main/hub/core/compression.py.

Feel free to join our slack communtiy at https://slack.activeloop.ai #develop channel to discuss in more details how to complete the contribution.

Looking forward to it.

davidbuniat avatar Jul 16 '22 22:07 davidbuniat

Hey @ziadomalik. The list of supported compressions can be found in hub/compression.py. Make sure to add your new format there. The decompression code can be found at hub/core/compression.py. Import the required libraries and write your decompression function (something like _decompress_jetraw). Also see the decompress_array function in hub/core/compression.py - that's where your function will be called. Let me know if you need more help.

FayazRahman avatar Jul 17 '22 03:07 FayazRahman

@ziadomalik I would like to work.Kindly assign me

h20200051 avatar Sep 12 '22 09:09 h20200051

hey @h20200051 , We can only assign one issue per contributor, which one would you like to take on?

mikayelh avatar Sep 12 '22 11:09 mikayelh

Hey @mikayelh @davidbuniat is this issue still open? I would like to try and contribute to this issue.

Hussain0520 avatar Sep 15 '22 19:09 Hussain0520

Hi all, so I've been assigned to work on other projects, so in this case this issue is on hold (for now). Yet it's still something we are actively discussing, and if it's appropriate, we could close this issue and reopen it once it becomes relevant again. I still need to the green light from my Project Manager. I hope you understand and thank you for your patience.

dotphoton-ziad avatar Sep 15 '22 20:09 dotphoton-ziad

@ziadomalik if you'd like, we can assign this issue to @Hussain0520 to work on it in the meantime, but if you want to be the one who writes this particular code, I'm not against putting this on hold. Maybe the best solution could be allowing someone else to take a stab and then improving on their contribution later on?

mikayelh avatar Sep 15 '22 20:09 mikayelh

@mikayelh Sounds like a plan! As a starting point, you guys could check out our Python documentation here and learn about the technology itself here. We also have a C++ API. Whenever you have questions, code review requests or anything I could help with, feel free to ping me! cc: @Hussain0520

dotphoton-ziad avatar Sep 15 '22 20:09 dotphoton-ziad

That's awesome!

hey @Hussain0520! I've just assigned you this issue - feel free to check in with us and @ziadomalik in case you need any help! Thanks for following up, @ziadomalik :)

mikayelh avatar Sep 15 '22 20:09 mikayelh

Hi, so I spoke with my project manager. Normally, we wanted to postpone this all the way to January because internally, we are still experimenting with the cloud and how our compression fits best into that context. If you guys would like to discuss, we could hop on a call so we could figure out the best way we can integrate Jetraw into the Activeloop Hub. cc: @mikayelh

dotphoton-ziad avatar Sep 16 '22 07:09 dotphoton-ziad

Thank you @mikayelh @ziadomalik . I'll surely contact you guys for help.

Hussain0520 avatar Sep 16 '22 10:09 Hussain0520

Please allow me to sneak in here... I was looking for a way how to compress my 1 million nifti stacks. On one hand "nifti" is not yet supported by deeplake (but dicom is). On the other hand, I was looking for a way to use the general dtype and add a custom compression on top. I was even more surprised to see someone from dotphoton here (@ziadomalik) You guys are on my list for more than a year. The stars seem to align :-)

St3V0Bay avatar Nov 22 '22 08:11 St3V0Bay

Hi @St3V0Bay. Thx for following up on this thread! Adding custom compression is quite tricky, because even if it's implemented in Deep Lake OSS, it won't work in our visualizer or the optimized C++ dataloader, because they are not in the OSS repo.

We're also happy to add support for your nifti data directly. Are you working with dicom files that are combined into nifti stacks? If you're able to provide us with example data, we can implement support for it across our stacks.

Regarding dotphoton, are you using any of their compressions currently, or this is something you're excited about for future work?

istranic avatar Nov 22 '22 22:11 istranic

Hi @istranic, thanks for the swift response. I see - so custom compressions are a bit tricky to handle.

Re nifti: In the medical imaging domain most open-source data is offered as nifti (bioimaging has their own preference however). That's the best format for data scientists to get started. However, DICOM is the true standard that is actually used in the clinic (you have that already integrated, which is great). To pool DICOM data with open-sourced nifti files, the dicom files are converted (e.g. https://github.com/rordenlab/dcm2niix). The other way (from nifti > dicom) is a lot more complicated.

Exemplary nifti files can be pulled using this repo (https://github.com/neheller/kits19). After installation it is just a one liner. You can look at the data using ITKSnap (for example; http://www.itksnap.org/pmwiki/pmwiki.php) and it can be opened in Python with the PyLib called nibabel (https://nipy.org/nibabel/). Another huge nifti repository is here: http://medicaldecathlon.com/

Re dotphoton: we are not using it. But their value proposition is really charming, which is: less costs for storage, faster data transfer. In projects with a certain size this really starts to matter, because things add up quickly if you have literally millions of data points.

St3V0Bay avatar Nov 25 '22 15:11 St3V0Bay

Thanks for the info @St3V0Bay. We'll keep you in the loop regarding our decision making around nifti support.

istranic avatar Nov 26 '22 14:11 istranic