Add Universc
Add the UniverSC single-cell pipeline
Implements a completely open-source single-cell RNA pipeline that supports > 30 technologies. Uses publicly available docker container will all dependencies installed.
PR checklist
Closes #1644
- [x] This comment contains a description of changes (with reason).
- [x] If you've fixed a bug or added code that should be tested, add tests!
- [x] If you've added a new tool - have you followed the module conventions in the contribution docs
- [x] If necessary, include test data in your PR.
- [x] Remove all TODO statements.
- [x] Emit the
versions.ymlfile. - [x] Follow the naming conventions.
- [x] Follow the parameters requirements.
- [x] Follow the input/output options guidelines.
- [x] Add a resource
label - [ ] Use BioConda and BioContainers if possible to fulfil software requirements.
- Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
- [x]
PROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware - [x]
PROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware - [ ]
PROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
- [x]
I'll share updates to the UniverSC module for feedback. It still needs tests but inputs and documentation have been updated.
Note this module requires a docker container and a conda environment is not configured yet.
Hi @TomKellyGenetics, thank you for this PR! We are merging as many modules as possible now due to and impending restructuring of the entire repo. This will mean you will need to update the module to reflect these changes before it can be merged in the future. It appears like this module isn't ready to be merged so if applicable, we are converting it to draft and adding the WIP label. If this isn't the case please let us know and we will try to get the module in before the changes. Thanks again!
Hi @JoseEspinosa thanks for letting me know. I'm ok to mark it as Draft/WIP as it is but it is almost ready to review I think.
What is the timeframe for the restructuring? Our publication is accepted at the proofing stage so it would be good timing to add the tool to nf-core at the same time.
I still need to test the code but it is very similar to Cell Ranger so I wouldn't anticipate many problems there. The main reason for slow progress is I've changed jobs and I don't have much time to spend on this project anymore. I'd like to complete it anyway if possible. Thanks for your patience.
No worries at all! 😄 The restructuring was planned for the end of this week or the beginning of next week. If you can't do it you can always adapt the module to this change after the changes. The change consists mainly in a change in the structure of the directories of the repository so that subworkflows can be tested consistently, so should not be a big deal anyway.
Thanks for clarifying. I was wondering if templates would change making it problematic to use older ones. I'll try resolving the linting syntax errors and testing it. If the finishing touches are not done in time, it should be no problem to move files on my branch as necessary to avoid conflicts.
I think the only change will be from where modules are included in the test main.nf. Anyhow, as soon as you fix the linting and the tests please tag it as ready for review so that we are aware and eventually can be included in the repo before the restructuring comes in place.
Thanks for the explanation @JoseEspinosa, I'll update tests on the current version first and then pull updates and merge the next version from nf-core.
As you can see above I've resolved linting issues and GitHub Actions passes on these. I've updated the test jobs on our podman node and it now starts without syntax errors I checked the following commands in the repo root directory with nextflow version 22.04.5.
nextflow run ./tests/modules/cellranger/count -entry test_cellranger_count -c ./tests/config/nextflow.config -c ./tests/modules/cellranger/count/nextflow.config
nextflow run tests/modules/universc -entry test_cellranger_10x -c tests/config/nextflow.config
nextflow run tests/modules/universc -entry test_universc_10x -c tests/config/nextflow.config
Currently with local testing, the UniverSC module gives the same "OCI runtime error" (when running with podman) as the already merged Cell Ranger tests. I've pushed it to GitHub to pull to another node with rootless Docker installed. This should allow testing without this error.
I built the UniverSC container with podman build --format docker on another server but it seems to still have problems with podman. Do you know if podman can be supported in addition to docker and conda for nextflow? I'll continue testing but I think it is almost ready to review.
I confirmed that the Cell Ranger test can pass without errors running locally on the Docker node. I'll need more time to set up the UniverSC test runs but it should be easier now I have an environment set up to run successful test jobs on similar tools.
It seems one of the linting jobs that previously passed it now failing. Is this due to the reorganised directory structure? https://github.com/nf-core/modules/actions/runs/3202912488/jobs/5232397083
I'd would say that yes, note that all the modules have been moved from modules/foo/main.nf to modules/nf-core/foo/main.nf. See here. So the lint is trying to find universc in modules/nf-core/universc/main.nf and that is why is failing. You will need to restructure your module too and check that all the includes point to the new path.
@JoseEspinosa Thank you for your advice. I suspect this is what is causing issues with automated checks. I'll try merging the restructured master branch and moving files for the new module to resolve this. This should make ease review and merging later as well.
As expected Linting seems to pass checks again. However, "check changes" on GitHub actions seems to have trouble detecting which modules to run tests on in the new restructured repository.
As expected Linting seems to pass checks again. However, "check changes" on GitHub actions seems to have trouble detecting which modules to run tests on in the new restructured repository.
How do you mean? I think that the tests for the module are running e.g. here for docker or I am not understanding what you are referring to?
Sorry for the confusion. I was able to fix this issue here: https://github.com/nf-core/modules/pull/1706/commits/a79e689429ec6aa262a75028272c1f15a5113b08
Tests now start correctly. I am debugging why they fail currently.
I think automated tests passing is the only remaining step to complete this PR. I'll update the input arguments for UniverSC as needed to troubleshoot tests. The inputs are very similar to Cell Ranger so I anticipate this will not be difficult.
I've managed to run tests locally for CELLRANGER_COUNT_OS. This is an open-source implementation of Cell Ranger (not subject to the 10X EULA) that is called by UniverSC. UniverSC is a wrapper which santises inputs and adjusts data formats from various technologies so testing both is required.
I checked that linting runs without errors with the latest version of nf-core installed.
conda activate nf-core
mamba update -n nf-core -c conda-forge -c bioconda nf-core
nf-core modules lint universc -d .
The test job fails due to this error when calling Cell Ranger.
[error] Your reference does not contain the expected files, or they are not readable
However running the latest version of Cell Ranger installed locally in the working directory of the abort job works. This confirms that the input arguments are correct. The directory also contains symbolic links to the reference and FASTQ files generated by previous test steps.
/home/DATA/software/cellranger-7.0.1/bin/cellranger \
count \
--id='sample-test' \
--fastqs=. \
--transcriptome=homo_sapiens_chr22_reference \
--sample=test_10x \
--localcores=2 \
--localmem=3 \
--chemistry='SC3Pv3'
Note the chemistry argument must be set by default as test data is too small for automated chemistry detection. UniverSC controls chemistry parameters and adjusts the barcode whitelists used in this step so it is not a problem.
I discovered why the test jobs fail with the current reference generated by the nf-core cellranger module. CELLRANGER_MKREF runs Cell Ranger version 7.0.0 the reference output parameters show STAR version 2.7.1a was used to generate it. This reference is incompatible with the open-source implementation of Cell Ranger 3.0.2 which runs STAR 2.5.1b.
$ docker run -it tomkellygenetics/universc:latest cellranger count --version
/cellranger-3.0.2.9001/cellranger-cs/3.0.2.9001/bin
cellranger count (3.0.2.9001)
$ docker run -it tomkellygenetics/universc:latest STAR --version
STAR_2.5.1b
$ $ grep "version" ./homo_sapiens_chr22_reference/star/genomeParameters.txt
versionGenome 2.7.1a
This means that UniverSC is not compatible with cellranger/mkref v7.0.0. or star/genomegenerate v2.7.10a.
I'll need to generate a new STAR reference transcriptome with the older version to get tests to pass without errors. It should be possible to do this with the containerised STAR v2.5.1b installed in tomkellygenetics/universc:1.2.3 but it will take more time.
Tests for CELLRANGER_COUNT_OS pass locally. I checked the contain supports an open-source implementation of cellranger mkref.
$ nextflow run tests/modules/nf-core/universc -entry test_cellranger_10x -c tests/config/nextflow.config
N E X T F L O W ~ version 22.04.5
Launching `tests/modules/nf-core/universc/main.nf` [crazy_williams] DSL2 - revision: a24dbee69b
executor > local (3)
[fc/f99e89] process > test_cellranger_10x:CELLRANGER_MKGTF (genome.gtf) [100%] 1 of 1 ?
[40/dfe599] process > test_cellranger_10x:CELLRANGER_MKREF (genome.fasta) [100%] 1 of 1 ?
[21/82a5b3] process > test_cellranger_10x:CELLRANGER_COUNT_OS (test) [100%] 1 of 1 ?
Completed at: 13-Oct-2022 16:03:04
Duration : 6m 25s
CPU hours : 0.2
Succeeded : 3
I'll refactor this to run cellranger mkref as a separate CELLRANGER_MKREF_OS process to avoid repeating STAR indexing for each run (which will be troublesome to scale to larger references).
Cell Ranger "auto" chemistry detection fails on small input files so I have altered the test jobs for when technology="10x". Other technologies configure chemistry settings automatically. This requires write permissions to the cellranger install directory so I use there parameters. https://github.com/nextflow-io/nextflow/issues/1295#issuecomment-535000561
container 'tomkellygenetics/universc:1.2.3'
containerOptions = "--user root"
I resolved the reference index version conflicts, first by rebuilding indexes with the Cell Ranger 3.0.2 version bundled in the UniverSC container, then by migrating these steps to separate processes. Using the nf-core cellranger module as a template, I set up new processes running "cellranger mkfastq/mkref/mkgtf" with the older open-source implementation. This version of Cell Ranger is not subject to the 10X Genomics EULA so it can be run without restrictions on other data.
Tests pass locally.
$ nextflow run tests/modules/nf-core/universc -entry test_cellranger_10x -c tests/config/nextflow.config
N E X T F L O W ~ version 22.04.5
Launching `tests/modules/nf-core/universc/main.nf` [curious_mayer] DSL2 - revision: 9801713b87
executor > local (3)
[14/fad574] process > test_cellranger_10x:CELLRANGER_MKGTF_OS (genome.gtf) [100%] 1 of 1 ?
[54/f87309] process > test_cellranger_10x:CELLRANGER_MKREF_OS (genome.fasta) [100%] 1 of 1 ?
[b6/61335f] process > test_cellranger_10x:CELLRANGER_COUNT_OS (test) [100%] 1 of 1 ?
Completed at: 14-Oct-2022 16:25:19
Duration : 6m 26s
CPU hours : 0.2
Succeeded : 3
$ nextflow run tests/modules/nf-core/universc -entry test_universc_10x -c tests/config/nextflow.config git:(universc|?2・
N E X T F L O W ~ version 22.04.5
Launching `tests/modules/nf-core/universc/main.nf` [goofy_shirley] DSL2 - revision: 9801713b87
executor > local (3)
[72/3c4041] process > test_universc_10x:CELLRANGER_MKGTF_OS (genome.gtf) [100%] 1 of 1 ?
[2c/19242d] process > test_universc_10x:CELLRANGER_MKREF_OS (genome.fasta) [100%] 1 of 1 ?
[81/ce69ea] process > test_universc_10x:UNIVERSC (123) [100%] 1 of 1 ?
Completed at: 14-Oct-2022 16:25:14
Duration : 6m 54s
CPU hours : 0.2
Succeeded : 3
$ cat output/*/*versions*
"test_cellranger_10x:CELLRANGER_COUNT_OS":
cellranger: 3.0.2.9001
"test_universc_10x:UNIVERSC":
cellranger: 3.0.2.9001
universc: 1.2.3
Tests on GitHub Action for docker and singularity start without syntax errors. I've had some trouble getting checks to pass but it is running an updated version of test.yml now. https://github.com/nf-core/modules/actions/runs/3248199182/jobs/5329138146
As with the original cellranger module, no conda environments are available (open to suggestions to build one). There is a known issue with singularity but surprisingly it seems to be resolved by the above changes. https://github.com/minoda-lab/universc/issues/2
Any advice on how to troubleshoot problems in the GitHub Actions results would help. Since the expected outputs are generated on local testing, this is the last step to address on the checklist I think.
Since only minor issues with testing remains I will open this PR for review. Your feedback is appreciated.
@Emiller88 thank you for taking the time to review! It shouldn't be difficult to make the recommended changes. Let me explain a few general points.
UniverSC has several advantages over Cell Ranger but gives similar results. Most importantly, it runs an open-source implementation on MIT and GPL3 Licenses. It is not subject to the 10X Genomics End User License Agreement (which forbids use on data from other platforms) and thus enables processing single-cell data from many different platforms with the same unified pipeline. See our preprint (recently accepted for publication) for more details.
The current Cell Ranger LICENSE has the following notice. UniverSC and all software in our container does not.
- The above rights granted in the Software may be exercised only in connection with a 10x Genomics Product, rightfully purchased from 10x Genomics or an authorized reseller, or data generated using such a 10x Genomics Product.
UniverSC calls an open-source version of Cell Ranger 3.0.2 (newer versions are not available on this license). The current Cellranger container uses version 7.0.0 and a newer version of the STAR aligner. Therefore it is necessary for UniverSC to run a separate Cell Ranger process using the open-source container to ensure:
- a reference index compatible with the same version of STAR is used
- all software used in the UniverSC module is open-source and permits use of data other than 10X Genomics libraries
I've named the open-source Cell Ranger processes differently to distinguish them from the existing module. I can update these to conform to nf-core naming conventions. To use the newer version of Cell Ranger on 10X Genomics data, the current "cellranger" module is complementary. The UniverSC container runs older open-source versions with greater flexibility. Unfortunately this currently leads to some code redundancy in the modules. I am open to suggestions on how to deal with this.
Sorry I may have misunderstood the use of "Dockerfile" in other modules. If this is only for nf-core containers then I agree it is best to remove it in this case as the UniverSC container already has source maintained here.
I'll remove podman from the GitHub Actions workflows. This was a test with automated testing configuration but it is more complicated than I anticipated. I agree it is best discussed separately.
I'm still having trouble getting automated testing with "PROFILE=docker" to work, despite generating results in output/universc in local tests. Suggestions on how to debug this would be really appreciated.
I've updated the module. Note that all tests with PROFILE=docker now pass successfully. The only module that does not run without errors is universc/mkfastq. As with cellranger/mkfastq (which uses a "stub"), this test job takes considerably longer.
UniverSC requires a compatible reference with the same software versions (universc/mkref is needed) but universc/mkfastq could be removed. This function is supported by the open-source implementation but it is specific to 10x Genomics adapter sequences (demultiplexing with a combination of indexes for each sample). This is not essential to the main benefit of the new module which allows running cellranger count on other technologies.
The main components are:
-
universc/countwhich runs 'cellranger count` directly to test the open-source version with 10X Genomics parameters -
universcwhich runsbash launch_universc.shwith a "technology" input parameter and adjusts the run to be compatible it
Only things I still see is whether the testing strategy is going to run the tests twice, and might change the name of the main module.
Thanks for checking again. I noticed this issue myself so had been considering to move the universc module (it is now universc/launch run as UNIVERSC_LAUNCH. I checked some tools in the repo (e.g., chromap, happy, kaiji, krakens, srst2, stringtie) already have nested module/module directory structure but I think this ambiguity could cause confusion or technical problems.
I've updated the PR to address issues raised above. Specifically redundant unit tests have been removed. This should speed up testing for the module in the future.
Hi @TomKellyGenetics - I really willl need quite a bit of time to review this - will see how fast I can get to it 😓
What would your proposal for scrnaseq be - having this as a separate subworkflow? Looks like that would be the easiest way forward 👍🏻
@apeltzer Thank you for taking time to review it. I understand it is difficult to find time (see my proposals to implement this on the 'demultiplex' pipeline https://github.com/nf-core/demultiplex/issues/14 as long as 18 months ago...). I'm not in a hurry to merge this and I understand that reviews take time though. We have an accepted article in press about it so it would be nice to release the module to nf-core users soon if possible.
As discussed above, it has complementary functionality to the existing STAR and Cell Ranger modules, running in a new container with exclusively open source dependencies. It supports a wider range of technologies than Cell Ranger and is not subject to the 10X Genomics EULA. My recommendation would be to add UniverSC as a separate tool/aligner option to retain existing workflows for 10X data with the newer proprietary version of Cell Ranger (7.0.0).
Of course, our tool is compatible with 10X Genomics data (as shown by the unit tests) but I think it should be the user's choice to switch to UniverSC (which rolls back to an older Cell Ranger 3.0.2).
It should be fairly straightforward to add UniverSC to pipelines running Cell Ranger as the inputs are almost the same. There is one additional option --technology used to configure which single-cell instrument or protocol (e.g., 10X Genomics, DropSeq, inDrops-v3, SmartSeq3, etc) was used to generate the data.
In general I am fine with this - please make sure the tests pass, e.g. adding md5sums or update the pytest yaml to e.g. remove tests for md5sums if these do not stay the same (e.g due to sorting order not being stable or similar things).
Thanks for taking time to review @apeltzer. I've updated test.yml files accordingly. As mentioned, some output files are slightly different in each run due to containing copies of the input call with temporary file paths for example. I've removed files from the YAML files if the correct output cannot be checked. It is still possible to confirm a successful test run from reproducible output files and summary statistics.
I think this update addresses all concerns raised. Let me know if anything else needs to be discussed or changed in order to meet criteria to merge.
One tiny test missing 🤞🏻 - the web_summary.html in both universc count and the cellranger count part. Can you try to fix this?
Once the tests are fixed (last one!) , I think the code is fine. Hosting the docker images is another story - we tend to host other images (e.g. demux, some other 10X containers) on our main nf-core account to make sure that a.) these are there and are not at some point deleted and b.) are "kind of trustworthy" / under control of the community and not a single individual - would you be fine to mirror the images you provide to the nf-core account to keep things the same for universc here too @TomKellyGenetics ?
Also, once the modules are in: Would make sense to use the upcoming subworkflow feature and create a subworkflow out of universc modules that can then only be imported to the nf-core/scrnaseq pipeline - other pipelines could also consider importing the workflow, e.g. ATACseq could potentially import it to do scATACseq ...? Or does that not make sense to also have the subworkflow shared via a central module repository? Leaving that bit up to you ;-)
One tiny test missing 🤞🏻 - the web_summary.html in both universc count and the cellranger count part. Can you try to fix this?
@apeltzer I've returned from holidays so now I can modify the tests for the web summary files. This shouldn't be much trouble.
Updating permissions in the container is more troublesome but I am open to a mirror being hosted by nf-core. As discussed above, I misunderstood how nf-core containers are managed and removed references to it before.
If you wish to build a new container, the Dockerfile can be found here: https://github.com/minoda-lab/universc/blob/master/Dockerfile Let me know if I can help do this.
Note UniverSC is now published in Nature Communications and we will continue to ensure the supplemental data, including the image to run a container is available. https://www.nature.com/articles/s41467-022-34681-z
I can support building a subworkflow and integrating it into the scRNA-Seq pipeline. Unfortunately scATAC-Seq is not currently supported due to some complications.
I looked into mirroring the docker image at nf-core/universc. I am a member of the GitHub organisation but do not have permissions to update the Docker images here: https://hub.docker.com/u/nfcore
If anyone who administers this account is willing do it, I can share the docker build configuration file. All dependencies are open-source and available to install freely. It may be possible to change permissions on the cellranger install directory to avoid running the container as root user but I have not tested this yet.
@apeltzer Should I add the "subworkflow" for UniverSC as a separate PR or include it in this one? I noticed there are subworkflows included with this modules repo and also specific for each pipeline:
https://github.com/nf-core/modules/tree/master/subworkflows/nf-core https://github.com/nf-core/scrnaseq/tree/master/subworkflows/local
As you suggest, this would be specific to scRNA-Seq analyses, so I agree to submitting it there (after universc is merged into "modules") if it is more appropriate. It has very similar functionality to the "align_cellranger.nf" subworkflow so putting it in the same directory makes sense to me. Thanks for your advice.