slideflow icon indicating copy to clipboard operation
slideflow copied to clipboard

Better compatibility for databricks

Open Meijian opened this issue 2 years ago • 11 comments

Feature

Solve compatibility issues with the Databricks platform. For example, zip file restrictions.

Pitch

Databricks have some unique restrictions which have caused compatibility issues with the slideflow. A more DB-compatible version of slideflow is beneficial to this user group.

Alternatives

Additional context

Start with solving zip file generation problems on Databricks.

Meijian avatar Apr 13 '23 15:04 Meijian

Thanks - I've created the branch databricks for development. I added a commit that should address the DatasetFeatures.to_torch() issue you encountered previously. Let me know if that works.

There are a couple of other functions that save data as ZIP files, including:

  • Heatmap.save_npz()
  • SlideMap.save()
  • MIL attention export during validation or evaluation
  • Slide QC mask saving/loading

Would we need to extend functionality for all of these, as well?

jamesdolezal avatar Apr 13 '23 17:04 jamesdolezal

Hi James, thanks, that was quick! I will try it now. For the other functions, yes, please address them if possible. I believe I will need to use some of these functions as well!

Meijian avatar Apr 13 '23 17:04 Meijian

Hi James, I tried your fix. It took some time to run. I was still not able to run it through. Below is the error. It looks similar to previous ones but not exactly the same. Thanks!

features.to_torch(rootpath + '/imagenet/bag_directory/') Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:-- Traceback (most recent call last): File "", line 1, in File "/databricks/driver/slideflow/slideflow/model/features.py", line 556, in to_torch tfrecord2idx.save_index( File "/databricks/driver/slideflow/slideflow/util/tfrecord2idx.py", line 70, in save_index np.savez(index_file, index_array) File "<array_function internals>", line 5, in savez File "/databricks/python/lib/python3.10/site-packages/numpy/lib/npyio.py", line 618, in savez _savez(file, args, kwds, False) File "/databricks/python/lib/python3.10/site-packages/numpy/lib/npyio.py", line 721, in _savez with zipf.open(fname, 'w', force_zip64=True) as fid: File "/usr/lib/python3.10/zipfile.py", line 1180, in close self._fileobj.seek(self._zipfile.start_dir) OSError: [Errno 95] Operation not supported

Meijian avatar Apr 14 '23 00:04 Meijian

Hmm... just to confirm, did you set the environmental variable SF_ALLOW_ZIP=0? I'm not sure how this error could be encountered if that variable is set.

jamesdolezal avatar Apr 14 '23 00:04 jamesdolezal

It's likely. I just started rerunning and will let you know tomorrow morning once it's finished. Thanks!

Meijian avatar Apr 14 '23 03:04 Meijian

Hi James, it worked. I probably missed the environment variable. I'm going to train MIL models, attention-based MIL won't work yet at this time because of zip file issue, correct?

Meijian avatar Apr 14 '23 13:04 Meijian

I've just added a possible solution for attention-based MIL - give it a try and let me know if it works!

jamesdolezal avatar Apr 14 '23 13:04 jamesdolezal

Hi James, it worked like a charm! Although I did notice that there were a few necessary packages for the training were not included in the installation, for example, fastai.

Meijian avatar Apr 14 '23 17:04 Meijian

Glad to hear it!

Re: dependencies - as you are aware, Slideflow is seeking to support a diverse set of deep learning tasks (segmentation, image generation, self-supervised learning, classification) and training paradigms. Some of these tasks have specific version requirements (e.g. StyleGAN requires PyTorch < 1.12) or dependencies (fastai for MIL; cellpose for cell segmentation), and we have an entirely separate backend for Tensorflow and PyTorch, each with their own separate dependencies.

Rather than requiring all users to install all dependencies, the approach we have taken is to limit the auto-installed dependencies to only what all users will use, and then users can install additional dependencies based on their needs. For example, this will install only the base requirements of slideflow:

pip install slideflow

This will install dependencies for cell segmentation:

pip install slideflow[cellpose]

This will install all of the PyTorch-associated dependencies, including FastAI:

pip install slideflow[torch]

and so on. The installation instructions at https://slideflow.dev/installation/ do note that PyTorch users should install with pip install slideflow[torch], so this should have installed the FastAI dependency, as well.

We're definitely open to hearing suggestions for alternative approaches. We could also expand the discussion of this in the installation instructions.

jamesdolezal avatar Apr 14 '23 17:04 jamesdolezal

Got it, makes sense to make it need-based. I think it was also because I installed it from source so it might be a different experience if I use other methods like pip. I will definitely let you know if I have more thoughts about this. Thanks!

Meijian avatar Apr 14 '23 17:04 Meijian

Hi @jamesdolezal, I encountered zip file issue again when running slide_map.save_umap('path') even after I defined the environment variable. Thought that you might have missed this one. Screenshot 2023-04-25 142053

Meijian avatar Apr 25 '23 18:04 Meijian