nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Improve charliecloud support

Open nschan opened this issue 1 year ago • 16 comments

I have tried to address issues with charliecloud.

https://github.com/nextflow-io/nextflow/issues/3566 Should be adressed by adding a lockfile to the parent directory of the cache. Adding lockfiles to the cache is prohibited by charliecloud 7c50942

https://github.com/nextflow-io/nextflow/issues/4463 I suggest to create a temporary container in the work directory from the storage 60539d5. This container can then be used with -w. It would be nicer if it was a squashfs instead of a directory, but I have not managed to build charliecloud with squashfs support.

nf-core moved to a container-registry and container-name approach a while ago, charliecloud did not support this (which broke nf-core with charliecloud). This should be addressed in cb3fb02

I have compiled nextflow with these commits and did a successful test run with one of my pipelines. I did not test the registry functionality with nf-core since the current nf-core configs do not support charliecloud.

nschan avatar Feb 02 '24 13:02 nschan

Deploy Preview for nextflow-docs-staging ready!

Name Link
Latest commit a550e52f3261e7c85da51ea56bd2d36647d8c9fc
Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/660d9e5919e7fb0008bcd771
Deploy Preview https://deploy-preview-4712--nextflow-docs-staging.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Feb 02 '24 13:02 netlify[bot]

The tests fail because the local container is not called e.g. busybox or ubuntu but container_busybox or container_ubuntu. This could be changed, I was mainly worried about problems arising from having a local container with the same name as a container in the storage. I do not know if that is a real problem though. I have also managed to build [email protected], but not older versions, with squashfs. With this, one could have the task specific container as a .sqfs file, instead of a directory. I think having a squashfs container in the work directory would be preferable to having a directory in the work directory. Do you have any opinions on this?

Edit: Since my HPC environment is running an old kernel, using squashfs is not an option for me (similar to --write-fake), so I would prefer to stick with a directory.

nschan avatar Feb 06 '24 09:02 nschan

I do not really know why the tests for the Cache fail. The only changes I made were adding the registry, and changing where the lockfile is placed, but I can't tell which of theses is causing problems.

nschan avatar Feb 06 '24 09:02 nschan

I think I fixed the tests for the Builder, but the Cache test will probably still fail. @phue do you maybe have an idea why this happens, I assume you wrote the cache test? In my nextflow build based on this branch the container pulls work fine, so I am not really sure whats happening here..

nschan avatar Feb 06 '24 13:02 nschan

91 files changed. This PR looks messed

pditommaso avatar Feb 06 '24 13:02 pditommaso

91 files changed. This PR looks messed

Yeah, this happened when I rebased this morning, I think the nextflow master branch changed? I can close and reopen a cleaner PR if you prefer.

nschan avatar Feb 06 '24 13:02 nschan

A push -f should fix, provided you have solved the problem in your branch

pditommaso avatar Feb 06 '24 13:02 pditommaso

Thanks, I think it is back to what it should be now.

nschan avatar Feb 06 '24 13:02 nschan

I realized that the previous commits did not handle cases where $CH_IMAGE_STORAGE was unset. This is better now. My problem is that I do not know how to properly adapt the test.

nschan avatar Feb 07 '24 10:02 nschan

I have explored the reason for the failing tests some more locally.

Builder should get run command and should get the ch-run command line

I think the builder test fail because I am expecting a path to an image, to be able to run convert, but what the test provides is an image name. I am not sure how to fix this. Even if I understood what is happening well enough to fix this, the test would still fail: I am extracting the cache directory from the image path when running convert. I do not know how to make sure that the test contains the exact directory that will be in the builder.

Cache should return cached image and should run ch-image pull command Problem is: Cannot invoke "java.util.Map.get(Object)" because "this.config" is null. I assume that this is because I have added registry which is not in the config, I did not change much else. I do not really know where the config comes from, so I am having trouble resolving this.

I would be very happy for some guidance on how to solve this. Overall the proposed changes lead to a functional charliecloud driver, irrespective of $CH_IMAGE_STORAGE. For me personally I can drop the addition of registry, but it would be nice for compatibility with nf-core pipelines.

nschan avatar Feb 07 '24 16:02 nschan

#3566 is not fixed by my changes. This still leads to a java.nio.channels.OverlappingFileLockException I assume that this relates to nextflow.file.FileMutex. I assume that multiple processes manage to lock the file, which I guess counters the idea of a FileMutex. Why this even possible I do not know.

nschan avatar Feb 08 '24 12:02 nschan

I am unable to come up with a solution using a lock file. However, what seems to work well in my tests is simply waiting and retrying if the pull fails. Currently I am waiting for 30 seconds and retrying a maximum of 5 times per image. This is of course less elegant than using a lockfile, but it does the job.

nschan avatar Feb 08 '24 16:02 nschan

I have now added in optional squashfs (useSquash) image support for the process-specific image. Since I have no squashfs compatible system I cant test this, but it is a very simple switch.

To summarize:

  • Concurrent pulls no longer happen (https://github.com/nextflow-io/nextflow/issues/3566)
  • Task-specific images are created for each task from the storage directory, by default these are directories, but an option for squashfs images was added (https://github.com/nextflow-io/nextflow/issues/4463). This circumvents the issue of running a stored image with -w. I think this is approach is generally better, since it actually completely isolates the individual nextflow processes. When nextflow processes run in the same writeable image / container one process could alter the image, affecting the others. In addition any changes to the image are permanently affecting the stored image, which can cause problem in cases where something needs to be done to the image as part of the process.
  • Added support for registry for compatibility with nf-core pipelines

I need help to adapt the charliecloud tests, the required changes are probably minor but beyond my abilities.

nschan avatar Feb 09 '24 12:02 nschan

I think this is now complete.

charliecloudBuilder now understands additional parameters: registry: see earlier comments, adds support for registry as used by nf-core writeFake: make use of --write-fake useSquash: use squashFs for the process image, default behaviour is creating a directory.

I have fixed the tests for the CharliecloudBuilder, they pass now. I have added tests for writeFake and useSquash.

nschan avatar Feb 12 '24 10:02 nschan

Is there anything I should add to this PR? @pditommaso

nschan avatar Feb 15 '24 08:02 nschan

Thanks for running the tests @bentsherman . I do not know why test_docs fails. I don't see how the step that fails (https://github.com/nextflow-io/nextflow/actions/runs/7887508884/job/21614649213#step:5:465) would be related to the changes I made to fix charliecloud support. If I am missing something and its somehow related to my changes please let me know.

nschan avatar Feb 16 '24 08:02 nschan

It was a recently added test, must be flaky because now it passed

bentsherman avatar Feb 22 '24 14:02 bentsherman

Since this PR has now been open for more than a month, I am wondering if there is anything I should add or change to facilitate merging?

nschan avatar Mar 21 '24 12:03 nschan

These failing tests appear to come from plugins:nf-google, I don't think they are related to the changes to the charliecloud driver, but if they are I would probably need some advice on whats going on...

nschan avatar Mar 21 '24 14:03 nschan

Indeed , very strange error

groovy.json.JsonException: Unable to determine the current character, it is not a string, number, array, or object

The current character read is '?' with an int value of 0
Unable to determine the current character, it is not a string, number, array, or object
line number 1
index number 255
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
...............................................................................................................................................................................................................................................................^
	at org.apache.groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:214)
	at org.apache.groovy.json.internal.JsonParserCharArray.decodeValue(JsonParserCharArray.java:165)
	at org.apache.groovy.json.internal.JsonParserCharArray.decodeFromChars(JsonParserCharArray.java:45)
	at org.apache.groovy.json.internal.JsonParserCharArray.parse(JsonParserCharArray.java:395)
	at org.apache.groovy.json.internal.BaseJsonParser.parse(BaseJsonParser.java:139)
	at org.apache.groovy.json.internal.BaseJsonParser.parse(BaseJsonParser.java:165)
	at groovy.json.JsonSlurper.parseFile(JsonSlurper.java:386)
	at groovy.json.JsonSlurper.parse(JsonSlurper.java:369)
	at nextflow.cloud.google.GoogleOpts.getProjectIdFromCreds(GoogleOpts.groovy:96)
	at nextflow.cloud.google.GoogleOpts.create(GoogleOpts.groovy:122)
	at nextflow.cloud.google.batch.client.BatchConfig.create(BatchConfig.groovy:67)
	at nextflow.cloud.google.batch.client.BatchConfigTest.should create batch config(BatchConfigTest.groovy:41)

pditommaso avatar Mar 21 '24 15:03 pditommaso

I have merged recent master into this, maybe that solves the failing tests..

nschan avatar Mar 25 '24 15:03 nschan

Do you think it would be worth opening a new pull request from a new branch, including only the changes for charliecloud? I am not sure if one of the unrelated commits that somehow ended up in this PR on Feb 12 could be causing the problem, or are the tests generally failing right now?

nschan avatar Mar 27 '24 12:03 nschan

Really not understanding google tests are failing here. @bentsherman any idea ?

pditommaso avatar Mar 27 '24 12:03 pditommaso

not entirely sure, but BatchConfigTest was created recently:

https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-google/src/test/nextflow/cloud/google/batch/client/BatchConfigTest.groovy#L41

and from the file history looks like you tried to fix some google tests last week, likely something is still broken with that

also does BatchConfigTest actually need credentials?

bentsherman avatar Mar 27 '24 16:03 bentsherman

Start to think if it's caused by the lack of GOOGLE_SECRET

https://github.com/nextflow-io/nextflow/blob/c9c7032c2e34132cf721ffabfea09d893adf3761/.github/workflows/build.yml#L91-L91

pditommaso avatar Mar 27 '24 16:03 pditommaso

that would do it if true

bentsherman avatar Mar 27 '24 16:03 bentsherman

Google tests are still failing 🤦‍♂️

bentsherman avatar Mar 28 '24 13:03 bentsherman

I have created a new branch from current master. If you think it makes sense to try with a new PR I can open a new one and close this. I am unsure what exactly the many commits I introduced from master on Feb 12 were and if those may have created some non-obvious problems with the tests, although the tests passed on Feb 22..

nschan avatar Mar 28 '24 14:03 nschan

You can try but I don't think it will make a difference. You are up to date with master and three are no extraneous changes in your PR, so a new PR will have the same problem

bentsherman avatar Mar 28 '24 15:03 bentsherman

@pditommaso as a temp solution I would be keen to disable the failing google tests so that these other PRs are not blocked

bentsherman avatar Mar 28 '24 15:03 bentsherman