webknossos-libs icon indicating copy to clipboard operation
webknossos-libs copied to clipboard

Add option to add remote mags and remote layer to an existing dataset

Open MichaelBuessemeyer opened this issue 1 year ago • 9 comments

Description:

This PR adds option to add remote mags and remote layer to an existing dataset via dataset.add_remote_layer(layer / URL, layer_name and `layer.add_remote_mag(mag / URL).

When using add_remote_mag make sure that the layer to which the mag is added has matching dtype, category and so on.

Testing:

  • I wrote tests for this, but the tests working with the plain URLs somehow do not work. Maybe I need to configure / code the tests differently :see_no_evil: I used this script for local testing. Run a local wk to make this work. Else change the local host URLs to webknossos.org urls.
small_dataset = Dataset.open(<local dataset path>)
new_mag = MagView._ensure_mag_view(
    "http://localhost:9000/data/zarr/sample_organization/l4_sample/segmentation/1"
)
new_layer = small_dataset.add_layer(
    "test_remote_layer",
    "segmentation",
    data_format=new_mag.info.data_format,
    dtype_per_channel=new_mag.get_dtype(),
)
new_layer.add_remote_mag(new_mag)
new_mag = MagView._ensure_mag_view(
    "http://localhost:9000/data/zarr/sample_organization/l4_sample/segmentation/2-2-1/"
)
new_layer.add_remote_mag(new_mag)
new_layer.add_remote_mag(
    "http://localhost:9000/data/zarr/sample_organization/l4_sample/segmentation/4-4-1/"
)
small_dataset.add_remote_layer(
    "http://localhost:9000/data/zarr/sample_organization/l4_sample/segmentation",
    "another_segmentation",
)

Issues:

  • fixes #https://github.com/scalableminds/voxelytics/issues/3469 partially only

Todos:

Make sure to delete unnecessary points or to check all before merging:

  • [ ] Updated Changelog
  • [ ] Updated Documentation
  • [ ] Added / Updated Tests
  • [ ] Considered adding this to the Examples

MichaelBuessemeyer avatar Sep 04 '24 14:09 MichaelBuessemeyer

It seems to me like the libs already support adding new mags / layers to remote datasets. Therefore, maybe calling the new methods add_foreign_mag/layer might be better suited.

MichaelBuessemeyer avatar Sep 05 '24 15:09 MichaelBuessemeyer

Therefore, maybe calling the new methods add_foreign_mag/layer might be better suited.

@daniel-wer I am now renaming the methods to *_foreign_* as this makes more sense as wklibs already supports handling remote datasets, but the new feature itself is having a layer / mag from another foreign dataset

MichaelBuessemeyer avatar Sep 10 '24 12:09 MichaelBuessemeyer

Should @markbader review this, maybe?

daniel-wer avatar Sep 11 '24 09:09 daniel-wer

Should @markbader review this, maybe?

Yeah maybe. to me it looks like a state for a first review :)

MichaelBuessemeyer avatar Sep 11 '24 11:09 MichaelBuessemeyer

It seems to me like the libs already support adding new mags / layers to remote datasets. Therefore, maybe calling the new methods add_foreign_mag/layer might be better suited.

Can you clarify why you renamed this? As far as I understand, the libs support opening remote datasets and it's possible to add layers and mags to them. Do you think add_remote_mag could be misinterpreted to mean that this is the method to add a mag to a remote dataset? I would not expect this to be an issue since the method is not called add_mag_to_remote_.... Instead, I think it might be confusing users what "foreign" mags or layers are. Using the term "remote" would be more clear as that terminology is used in other parts of the API as well (i.e. remote datasets). What do you and @markbader think? :)

daniel-wer avatar Sep 11 '24 13:09 daniel-wer

Using the term "remote" would be more clear as that terminology is used in other parts of the API as well (i.e. remote datasets). What do you and @markbader think? :)

Hmmm :thinking:, maybe you are right. My reasoning behind this was that imo the word foreign expresses that the layer / mag has a different dataset from where it originates and does not reside inside the dataset's directory structure. This was also motivated because code-wise it is necessary to know whether a mag / layer is within the dataset's directory hierarchy or from a different dataset (foreign). => Therefore, I introduced is_foreign_path on the layer class. Because is_remote_layer would be semantically incorrect in the following scenario: I have a dataset on s3 at s3://my/first/dataset. Now I link a locally stored layer /srv/data/datasets/second_dataset/layer1 into that dataset. In this scenario, only the first datasets should be considered remote, as it is stored on s3. But the layer is stored locally in the wk instance and there having it as "remote" would be stange imo. Thus, calling it "foreign" makes more sense / is more correct imo. => And now, in case the method with which one would add the "foreign" layer would be called add_remote_layer would be false naming wise. And in case the layer would have a property called "is_remote" would be false for the added foreign layer.

At least that's what I think is making more sense. I know that I might be breaking wklibs terminology here, but being able to link layers and mags of complete other datasets into a dataset is also a completely new feature.

But yeah maybe @markbader also has a good option on this

MichaelBuessemeyer avatar Sep 11 '24 14:09 MichaelBuessemeyer

I'm not sure whether I fully understood, but are you saying the method you introduced add_remote/foreign_layer can be used to add a layer from the local file system to a remote dataset? I would have expected it to not matter where a dataset is actually saved. Doesn't wklibs usually abstract from that (using upath etc.). Instead, I would have expected add_remote/foreign_layer only to be used when actually adding a remote layer :thinking: Or did I misunderstand you?

daniel-wer avatar Sep 12 '24 10:09 daniel-wer

I'm not sure whether I fully understood, but are you saying the method you introduced add_remote/foreign_layer can be used to add a layer from the local file system to a remote dataset?

Yes, this is one (untested) way the new methods can be used.

I would have expected it to not matter where a dataset is actually saved.

Yes, that's the idea. But my reasoning was in a specific scenario where the dataset itself is remote and the added layer is a local (from disk) one.

Ok I think we have two different images of what is remote in mind: I think this image describes it very well: I think your understanding of remote is from the datasets perspective while mine is from the perspective of where wklibs is running. (sorry for the not consistent color coding, wanted to make the stretch pretty quick without wasting time)

image

Does that make sense? Honestly, I do not have a strong preference for which mental model to choose. Which was the one used before? yours @daniel-wer ?

MichaelBuessemeyer avatar Sep 12 '24 11:09 MichaelBuessemeyer

Thank you for explaining! I'll think out loud.

I think the current "mental model" is that remote datasets/layers/mags are those that are not on the "same server" (accessed via file system) as the webknossos datastore or where webknossos-libs runs. At least that is my understanding and reading of the docs. See for example, this section or this example. The use case we want to enable in voxelytics is to be able to predict on remote datasets/layers/mags which in turn requires to be able to add the remote layers/mags to local datasets. To do so, I still find the term add_remote_layer/mag well aligned with the current mental model. You want to enable another use case, namely adding a local layer to a remote dataset. I guess that's probably a little more difficult than the first use case, because the remote dataset does not have a representation on the local file system, but rather the layer is added in the webknossos database, right? Does that require to label the layer as remote or foreign? Sorry I'm not familiar with that code.

In any case, maybe we don't overcomplicate things and focus on the first use case or is there an issue for the second one as well?

daniel-wer avatar Sep 12 '24 13:09 daniel-wer

@MichaelBuessemeyer What do you think about my last comment?

If possible, I'd like to get this merged and pulled into voxelytics in the next week, before I'm out of office :) Let me know if there are any open discussions or other blockers for you!

daniel-wer avatar Sep 19 '24 20:09 daniel-wer

To do so, I still find the term add_remote_layer/mag well aligned with the current mental model.

In this specific scenario, I agree with you. But I think the add remote /foreign layer / mag is more mighty than this. At least it has the potential add layers located on any server or accessible file system.

You want to enable another use case, namely adding a local layer to a remote dataset. I guess that's probably a little more difficult than the first use case, because the remote dataset does not have a representation on the local file system, but rather the layer is added in the webknossos database, right? Does that require to label the layer as remote or foreign? Sorry I'm not familiar with that code.

Yeah that's definitely more complicated. What happens is that wklibs updates the datasource-properties.json which include the path to each mag. Thus, the wk server & wklibs should be able to access layers from other "remote" datasets.

In any case, maybe we don't overcomplicate things and focus on the first use case or is there an issue for the second one as well?

sure :+1: Done

MichaelBuessemeyer avatar Sep 20 '24 10:09 MichaelBuessemeyer

@MichaelBuessemeyer @markbader Thank you for your work on this! I just saw that webknossos.utils includes an is_fs_path method. Could you clarify whether that's the exact opposite of the is_remote_path method (meaning the one returns true iff the other returns false and vice versa)? If so, one could implement one using the other to make that more clear, and it would also simplify/unify some imports in voxelytics if one of the two methods could be used exclusively.

daniel-wer avatar Sep 24 '24 19:09 daniel-wer

@MichaelBuessemeyer @markbader Thank you for your work on this! I just saw that webknossos.utils includes an is_fs_path method. Could you clarify whether that's the exact opposite of the is_remote_path method (meaning the one returns true iff the other returns false and vice versa)? If so, one could implement one using the other to make that more clear, and it would also simplify/unify some imports in voxelytics if one of the two methods could be used exclusively.

I'd say they are not strictly the opposite of each other as there are more implementations of upath which are not checked in either of the methods. See the list of official upath implementations from the gh repo

image

But I think we only use the few implementations that are checked in is_fs_path and is_remote_path. So I'll give it a try to define is_remote_path as the bool'sche negation of is_fs_path and lets see what the ci says. (Although the ci currently fails due to unknown reasons to me :/

@markbader Maybe you have an Idea why this is failing :see_no_evil:?

MichaelBuessemeyer avatar Sep 25 '24 06:09 MichaelBuessemeyer

But I think we only use the few implementations that are checked in is_fs_path and is_remote_path. So I'll give it a try to define is_remote_path as the bool'sche negation of is_fs_path and lets see what the ci says. (Although the ci currently fails due to unknown reasons to me :/

Ok this should work (at least the ci still fails at the same point as before) Here is the CI error:

FAILED tests/dataset/test_dataset.py::test_ome_ngff_metadata[output_path1] - FileNotFoundError: https://ngff.openmicroscopy.org/0.4/schemas/image.schema

Strangely tests/dataset/test_dataset.py::test_ome_ngff_metadata[output_path0] does not fail :woozy_face:

MichaelBuessemeyer avatar Sep 25 '24 06:09 MichaelBuessemeyer

I get the same CI error in https://github.com/scalableminds/webknossos-libs/pull/1196 since yesterday, Both output_path0 and output_path1 fail but they are splitted in the CI (so output_path0 fails in webknossos_linux(3.12,1) and output_path1 in webknossos_linux(3.12, 3)). I don't know where this error comes from, but I think it is unrelated to the failing tests that you wrote. I had this json decode error before and thought they might be caused by vcr-py. But I didn't found a good way for debugging this.

markbader avatar Sep 25 '24 07:09 markbader

I also think, this might be related to vcr-py. All the more reason to get rid of it. Can we skip these tests for now and move on with merging?

normanrz avatar Sep 25 '24 11:09 normanrz

@MichaelBuessemeyer @markbader Instead of commenting out the tests you can use this pytest decorator and also include the reason for the skip :)

@pytest.mark.skip(
    reason="The test is flaky on the CI <for some reason>. Disable it for now."
)

daniel-wer avatar Sep 25 '24 11:09 daniel-wer

I also think, this might be related to vcr-py. All the more reason to get rid of it. Can we skip these tests for now and move on with merging?

ok on it :)

MichaelBuessemeyer avatar Sep 25 '24 12:09 MichaelBuessemeyer

Now FAILED tests/dataset/test_add_layer_from_images.py::test_test_images[8909407-embedded_NCI_mono_matrigelcollagen_docetaxel_day10_sample10.czi-kwargs4-uint16-1-size4] fails. I tested locally and it works :shrug:

I'll give it a retry and if it doesn't work I'll have to disable this as well I guess :/

MichaelBuessemeyer avatar Sep 25 '24 14:09 MichaelBuessemeyer

@MichaelBuessemeyer Sorry, I would have wanted to add a changelog entry before merging, but bulldozer was too eager. We've already spoken to Norman to reconfigure bulldozer to only merge when the automerge label is active. This is somehow misconfigured in this repo.

daniel-wer avatar Sep 26 '24 11:09 daniel-wer