nextflow
nextflow copied to clipboard
Improve charliecloud support
I have tried to address issues with charliecloud.
https://github.com/nextflow-io/nextflow/issues/3566 Should be adressed by adding a lockfile to the parent directory of the cache. Adding lockfiles to the cache is prohibited by charliecloud 7c50942
https://github.com/nextflow-io/nextflow/issues/4463
I suggest to create a temporary container in the work directory from the storage 60539d5. This container can then be used with -w
. It would be nicer if it was a squashfs instead of a directory, but I have not managed to build charliecloud with squashfs support.
nf-core moved to a container-registry and container-name approach a while ago, charliecloud did not support this (which broke nf-core with charliecloud). This should be addressed in cb3fb02
I have compiled nextflow with these commits and did a successful test run with one of my pipelines. I did not test the registry functionality with nf-core since the current nf-core configs do not support charliecloud.
Deploy Preview for nextflow-docs-staging ready!
Name | Link |
---|---|
Latest commit | a550e52f3261e7c85da51ea56bd2d36647d8c9fc |
Latest deploy log | https://app.netlify.com/sites/nextflow-docs-staging/deploys/660d9e5919e7fb0008bcd771 |
Deploy Preview | https://deploy-preview-4712--nextflow-docs-staging.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
The tests fail because the local container is not called e.g. busybox
or ubuntu
but container_busybox
or container_ubuntu
. This could be changed, I was mainly worried about problems arising from having a local container with the same name as a container in the storage. I do not know if that is a real problem though.
I have also managed to build [email protected], but not older versions, with squashfs. With this, one could have the task specific container as a .sqfs file, instead of a directory. I think having a squashfs container in the work directory would be preferable to having a directory in the work directory.
Do you have any opinions on this?
Edit: Since my HPC environment is running an old kernel, using squashfs is not an option for me (similar to --write-fake), so I would prefer to stick with a directory.
I do not really know why the tests for the Cache fail. The only changes I made were adding the registry, and changing where the lockfile is placed, but I can't tell which of theses is causing problems.
I think I fixed the tests for the Builder, but the Cache test will probably still fail. @phue do you maybe have an idea why this happens, I assume you wrote the cache test? In my nextflow build based on this branch the container pulls work fine, so I am not really sure whats happening here..
91 files changed. This PR looks messed
91 files changed. This PR looks messed
Yeah, this happened when I rebased this morning, I think the nextflow master branch changed? I can close and reopen a cleaner PR if you prefer.
A push -f
should fix, provided you have solved the problem in your branch
Thanks, I think it is back to what it should be now.
I realized that the previous commits did not handle cases where $CH_IMAGE_STORAGE
was unset. This is better now.
My problem is that I do not know how to properly adapt the test.
I have explored the reason for the failing tests some more locally.
Builder
should get run command
and should get the ch-run command line
I think the builder test fail because I am expecting a path to an image, to be able to run convert, but what the test provides is an image name. I am not sure how to fix this. Even if I understood what is happening well enough to fix this, the test would still fail: I am extracting the cache directory from the image path when running convert. I do not know how to make sure that the test contains the exact directory that will be in the builder.
Cache
should return cached image
and should run ch-image pull command
Problem is: Cannot invoke "java.util.Map.get(Object)" because "this.config" is null
. I assume that this is because I have added registry
which is not in the config, I did not change much else. I do not really know where the config comes from, so I am having trouble resolving this.
I would be very happy for some guidance on how to solve this. Overall the proposed changes lead to a functional charliecloud driver, irrespective of $CH_IMAGE_STORAGE
.
For me personally I can drop the addition of registry, but it would be nice for compatibility with nf-core pipelines.
#3566 is not fixed by my changes. This still leads to a java.nio.channels.OverlappingFileLockException
I assume that this relates to nextflow.file.FileMutex
. I assume that multiple processes manage to lock the file, which I guess counters the idea of a FileMutex. Why this even possible I do not know.
I am unable to come up with a solution using a lock file. However, what seems to work well in my tests is simply waiting and retrying if the pull fails. Currently I am waiting for 30 seconds and retrying a maximum of 5 times per image. This is of course less elegant than using a lockfile, but it does the job.
I have now added in optional squashfs (useSquash) image support for the process-specific image. Since I have no squashfs compatible system I cant test this, but it is a very simple switch.
To summarize:
- Concurrent pulls no longer happen (https://github.com/nextflow-io/nextflow/issues/3566)
- Task-specific images are created for each task from the storage directory, by default these are directories, but an option for squashfs images was added (https://github.com/nextflow-io/nextflow/issues/4463). This circumvents the issue of running a stored image with
-w
. I think this is approach is generally better, since it actually completely isolates the individual nextflow processes. When nextflow processes run in the same writeable image / container one process could alter the image, affecting the others. In addition any changes to the image are permanently affecting the stored image, which can cause problem in cases where something needs to be done to the image as part of the process. - Added support for
registry
for compatibility with nf-core pipelines
I need help to adapt the charliecloud tests, the required changes are probably minor but beyond my abilities.
I think this is now complete.
charliecloudBuilder now understands additional parameters:
registry
: see earlier comments, adds support for registry as used by nf-core
writeFake
: make use of --write-fake
useSquash
: use squashFs for the process image, default behaviour is creating a directory.
I have fixed the tests for the CharliecloudBuilder, they pass now. I have added tests for writeFake
and useSquash
.
Is there anything I should add to this PR? @pditommaso
Thanks for running the tests @bentsherman . I do not know why test_docs
fails. I don't see how the step that fails (https://github.com/nextflow-io/nextflow/actions/runs/7887508884/job/21614649213#step:5:465) would be related to the changes I made to fix charliecloud support. If I am missing something and its somehow related to my changes please let me know.
It was a recently added test, must be flaky because now it passed
Since this PR has now been open for more than a month, I am wondering if there is anything I should add or change to facilitate merging?
These failing tests appear to come from plugins:nf-google
, I don't think they are related to the changes to the charliecloud driver, but if they are I would probably need some advice on whats going on...
Indeed , very strange error
groovy.json.JsonException: Unable to determine the current character, it is not a string, number, array, or object
The current character read is '?' with an int value of 0
Unable to determine the current character, it is not a string, number, array, or object
line number 1
index number 255
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
...............................................................................................................................................................................................................................................................^
at org.apache.groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:214)
at org.apache.groovy.json.internal.JsonParserCharArray.decodeValue(JsonParserCharArray.java:165)
at org.apache.groovy.json.internal.JsonParserCharArray.decodeFromChars(JsonParserCharArray.java:45)
at org.apache.groovy.json.internal.JsonParserCharArray.parse(JsonParserCharArray.java:395)
at org.apache.groovy.json.internal.BaseJsonParser.parse(BaseJsonParser.java:139)
at org.apache.groovy.json.internal.BaseJsonParser.parse(BaseJsonParser.java:165)
at groovy.json.JsonSlurper.parseFile(JsonSlurper.java:386)
at groovy.json.JsonSlurper.parse(JsonSlurper.java:369)
at nextflow.cloud.google.GoogleOpts.getProjectIdFromCreds(GoogleOpts.groovy:96)
at nextflow.cloud.google.GoogleOpts.create(GoogleOpts.groovy:122)
at nextflow.cloud.google.batch.client.BatchConfig.create(BatchConfig.groovy:67)
at nextflow.cloud.google.batch.client.BatchConfigTest.should create batch config(BatchConfigTest.groovy:41)
I have merged recent master into this, maybe that solves the failing tests..
Do you think it would be worth opening a new pull request from a new branch, including only the changes for charliecloud? I am not sure if one of the unrelated commits that somehow ended up in this PR on Feb 12 could be causing the problem, or are the tests generally failing right now?
Really not understanding google tests are failing here. @bentsherman any idea ?
not entirely sure, but BatchConfigTest
was created recently:
https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-google/src/test/nextflow/cloud/google/batch/client/BatchConfigTest.groovy#L41
and from the file history looks like you tried to fix some google tests last week, likely something is still broken with that
also does BatchConfigTest
actually need credentials?
Start to think if it's caused by the lack of GOOGLE_SECRET
https://github.com/nextflow-io/nextflow/blob/c9c7032c2e34132cf721ffabfea09d893adf3761/.github/workflows/build.yml#L91-L91
that would do it if true
Google tests are still failing 🤦♂️
I have created a new branch from current master. If you think it makes sense to try with a new PR I can open a new one and close this. I am unsure what exactly the many commits I introduced from master on Feb 12 were and if those may have created some non-obvious problems with the tests, although the tests passed on Feb 22..
You can try but I don't think it will make a difference. You are up to date with master and three are no extraneous changes in your PR, so a new PR will have the same problem
@pditommaso as a temp solution I would be keen to disable the failing google tests so that these other PRs are not blocked