webknossos-libs
webknossos-libs copied to clipboard
Add option to add remote mags and remote layer to an existing dataset
Description:
This PR adds option to add remote mags and remote layer to an existing dataset via dataset.add_remote_layer(layer / URL, layer_name and `layer.add_remote_mag(mag / URL).
When using add_remote_mag make sure that the layer to which the mag is added has matching dtype, category and so on.
Testing:
- I wrote tests for this, but the tests working with the plain URLs somehow do not work. Maybe I need to configure / code the tests differently :see_no_evil: I used this script for local testing. Run a local wk to make this work. Else change the local host URLs to webknossos.org urls.
small_dataset = Dataset.open(<local dataset path>)
new_mag = MagView._ensure_mag_view(
"http://localhost:9000/data/zarr/sample_organization/l4_sample/segmentation/1"
)
new_layer = small_dataset.add_layer(
"test_remote_layer",
"segmentation",
data_format=new_mag.info.data_format,
dtype_per_channel=new_mag.get_dtype(),
)
new_layer.add_remote_mag(new_mag)
new_mag = MagView._ensure_mag_view(
"http://localhost:9000/data/zarr/sample_organization/l4_sample/segmentation/2-2-1/"
)
new_layer.add_remote_mag(new_mag)
new_layer.add_remote_mag(
"http://localhost:9000/data/zarr/sample_organization/l4_sample/segmentation/4-4-1/"
)
small_dataset.add_remote_layer(
"http://localhost:9000/data/zarr/sample_organization/l4_sample/segmentation",
"another_segmentation",
)
Issues:
- fixes #https://github.com/scalableminds/voxelytics/issues/3469 partially only
Todos:
Make sure to delete unnecessary points or to check all before merging:
- [ ] Updated Changelog
- [ ] Updated Documentation
- [ ] Added / Updated Tests
- [ ] Considered adding this to the Examples
It seems to me like the libs already support adding new mags / layers to remote datasets. Therefore, maybe calling the new methods add_foreign_mag/layer might be better suited.
Therefore, maybe calling the new methods add_foreign_mag/layer might be better suited.
@daniel-wer I am now renaming the methods to *_foreign_* as this makes more sense as wklibs already supports handling remote datasets, but the new feature itself is having a layer / mag from another foreign dataset
Should @markbader review this, maybe?
Should @markbader review this, maybe?
Yeah maybe. to me it looks like a state for a first review :)
It seems to me like the libs already support adding new mags / layers to remote datasets. Therefore, maybe calling the new methods add_foreign_mag/layer might be better suited.
Can you clarify why you renamed this? As far as I understand, the libs support opening remote datasets and it's possible to add layers and mags to them. Do you think add_remote_mag could be misinterpreted to mean that this is the method to add a mag to a remote dataset? I would not expect this to be an issue since the method is not called add_mag_to_remote_.... Instead, I think it might be confusing users what "foreign" mags or layers are. Using the term "remote" would be more clear as that terminology is used in other parts of the API as well (i.e. remote datasets). What do you and @markbader think? :)
Using the term "remote" would be more clear as that terminology is used in other parts of the API as well (i.e. remote datasets). What do you and @markbader think? :)
Hmmm :thinking:, maybe you are right. My reasoning behind this was that imo the word foreign expresses that the layer / mag has a different dataset from where it originates and does not reside inside the dataset's directory structure. This was also motivated because code-wise it is necessary to know whether a mag / layer is within the dataset's directory hierarchy or from a different dataset (foreign). => Therefore, I introduced is_foreign_path on the layer class. Because is_remote_layer would be semantically incorrect in the following scenario:
I have a dataset on s3 at s3://my/first/dataset. Now I link a locally stored layer /srv/data/datasets/second_dataset/layer1 into that dataset. In this scenario, only the first datasets should be considered remote, as it is stored on s3. But the layer is stored locally in the wk instance and there having it as "remote" would be stange imo. Thus, calling it "foreign" makes more sense / is more correct imo.
=> And now, in case the method with which one would add the "foreign" layer would be called add_remote_layer would be false naming wise. And in case the layer would have a property called "is_remote" would be false for the added foreign layer.
At least that's what I think is making more sense. I know that I might be breaking wklibs terminology here, but being able to link layers and mags of complete other datasets into a dataset is also a completely new feature.
But yeah maybe @markbader also has a good option on this
I'm not sure whether I fully understood, but are you saying the method you introduced add_remote/foreign_layer can be used to add a layer from the local file system to a remote dataset?
I would have expected it to not matter where a dataset is actually saved. Doesn't wklibs usually abstract from that (using upath etc.). Instead, I would have expected add_remote/foreign_layer only to be used when actually adding a remote layer :thinking:
Or did I misunderstand you?
I'm not sure whether I fully understood, but are you saying the method you introduced add_remote/foreign_layer can be used to add a layer from the local file system to a remote dataset?
Yes, this is one (untested) way the new methods can be used.
I would have expected it to not matter where a dataset is actually saved.
Yes, that's the idea. But my reasoning was in a specific scenario where the dataset itself is remote and the added layer is a local (from disk) one.
Ok I think we have two different images of what is remote in mind: I think this image describes it very well: I think your understanding of remote is from the datasets perspective while mine is from the perspective of where wklibs is running. (sorry for the not consistent color coding, wanted to make the stretch pretty quick without wasting time)
Does that make sense? Honestly, I do not have a strong preference for which mental model to choose. Which was the one used before? yours @daniel-wer ?
Thank you for explaining! I'll think out loud.
I think the current "mental model" is that remote datasets/layers/mags are those that are not on the "same server" (accessed via file system) as the webknossos datastore or where webknossos-libs runs. At least that is my understanding and reading of the docs. See for example, this section or this example. The use case we want to enable in voxelytics is to be able to predict on remote datasets/layers/mags which in turn requires to be able to add the remote layers/mags to local datasets. To do so, I still find the term add_remote_layer/mag well aligned with the current mental model. You want to enable another use case, namely adding a local layer to a remote dataset. I guess that's probably a little more difficult than the first use case, because the remote dataset does not have a representation on the local file system, but rather the layer is added in the webknossos database, right? Does that require to label the layer as remote or foreign? Sorry I'm not familiar with that code.
In any case, maybe we don't overcomplicate things and focus on the first use case or is there an issue for the second one as well?
@MichaelBuessemeyer What do you think about my last comment?
If possible, I'd like to get this merged and pulled into voxelytics in the next week, before I'm out of office :) Let me know if there are any open discussions or other blockers for you!
To do so, I still find the term add_remote_layer/mag well aligned with the current mental model.
In this specific scenario, I agree with you. But I think the add remote /foreign layer / mag is more mighty than this. At least it has the potential add layers located on any server or accessible file system.
You want to enable another use case, namely adding a local layer to a remote dataset. I guess that's probably a little more difficult than the first use case, because the remote dataset does not have a representation on the local file system, but rather the layer is added in the webknossos database, right? Does that require to label the layer as remote or foreign? Sorry I'm not familiar with that code.
Yeah that's definitely more complicated. What happens is that wklibs updates the datasource-properties.json which include the path to each mag. Thus, the wk server & wklibs should be able to access layers from other "remote" datasets.
In any case, maybe we don't overcomplicate things and focus on the first use case or is there an issue for the second one as well?
sure :+1: Done
@MichaelBuessemeyer @markbader Thank you for your work on this!
I just saw that webknossos.utils includes an is_fs_path method. Could you clarify whether that's the exact opposite of the is_remote_path method (meaning the one returns true iff the other returns false and vice versa)? If so, one could implement one using the other to make that more clear, and it would also simplify/unify some imports in voxelytics if one of the two methods could be used exclusively.
@MichaelBuessemeyer @markbader Thank you for your work on this! I just saw that webknossos.utils includes an is_fs_path method. Could you clarify whether that's the exact opposite of the is_remote_path method (meaning the one returns true iff the other returns false and vice versa)? If so, one could implement one using the other to make that more clear, and it would also simplify/unify some imports in voxelytics if one of the two methods could be used exclusively.
I'd say they are not strictly the opposite of each other as there are more implementations of upath which are not checked in either of the methods. See the list of official upath implementations from the gh repo
But I think we only use the few implementations that are checked in is_fs_path and is_remote_path. So I'll give it a try to define is_remote_path as the bool'sche negation of is_fs_path and lets see what the ci says. (Although the ci currently fails due to unknown reasons to me :/
@markbader Maybe you have an Idea why this is failing :see_no_evil:?
But I think we only use the few implementations that are checked in is_fs_path and is_remote_path. So I'll give it a try to define is_remote_path as the bool'sche negation of is_fs_path and lets see what the ci says. (Although the ci currently fails due to unknown reasons to me :/
Ok this should work (at least the ci still fails at the same point as before) Here is the CI error:
FAILED tests/dataset/test_dataset.py::test_ome_ngff_metadata[output_path1] - FileNotFoundError: https://ngff.openmicroscopy.org/0.4/schemas/image.schema
Strangely tests/dataset/test_dataset.py::test_ome_ngff_metadata[output_path0] does not fail :woozy_face:
I get the same CI error in https://github.com/scalableminds/webknossos-libs/pull/1196 since yesterday, Both output_path0 and output_path1 fail but they are splitted in the CI (so output_path0 fails in webknossos_linux(3.12,1) and output_path1 in webknossos_linux(3.12, 3)). I don't know where this error comes from, but I think it is unrelated to the failing tests that you wrote. I had this json decode error before and thought they might be caused by vcr-py. But I didn't found a good way for debugging this.
I also think, this might be related to vcr-py. All the more reason to get rid of it. Can we skip these tests for now and move on with merging?
@MichaelBuessemeyer @markbader Instead of commenting out the tests you can use this pytest decorator and also include the reason for the skip :)
@pytest.mark.skip(
reason="The test is flaky on the CI <for some reason>. Disable it for now."
)
I also think, this might be related to vcr-py. All the more reason to get rid of it. Can we skip these tests for now and move on with merging?
ok on it :)
Now
FAILED tests/dataset/test_add_layer_from_images.py::test_test_images[8909407-embedded_NCI_mono_matrigelcollagen_docetaxel_day10_sample10.czi-kwargs4-uint16-1-size4] fails. I tested locally and it works :shrug:
I'll give it a retry and if it doesn't work I'll have to disable this as well I guess :/
@MichaelBuessemeyer Sorry, I would have wanted to add a changelog entry before merging, but bulldozer was too eager. We've already spoken to Norman to reconfigure bulldozer to only merge when the automerge label is active. This is somehow misconfigured in this repo.