modules icon indicating copy to clipboard operation
modules copied to clipboard

parabricks/fq2bam needs the fasta in the same folder as the BWAIndex

Open famosab opened this issue 2 months ago • 14 comments

This leads to troubles when running on a cluster because a simple symlinking logic does not work there. However we want to move away from copying files which is not effective.

@gburnett-nvidia can you bring this up again with your team? Because we need fq2bam to be able to handle the reference differently. Maybe two command-line flags? One for fasta and one for the index?

We can update the module as soon as that works :)

See #9229 for the reverting to copying.

famosab avatar Oct 23 '25 07:10 famosab

If I use any symlinking written in the script part of the process, I run into this error:

ln: failed to create symbolic link 'Homo_sapiens_assembly38.fasta': Permission denied

This is the cluster I am using: https://wiki.m3c.uni-tuebingen.de/ but from a quick search it seems to me that symlinking is something that might not easily work with the way certain filesystems are setup.

These are the commands for symlinking that I tried out:

    fasta_basename=\$(basename ${fasta})
    cd ${index} && \
        ln -s ../\$fasta_basename \$fasta_basename && \
        cd -

and

cd ${index} && ln -sf ../${fasta} ${fasta} && cd ..

famosab avatar Oct 23 '25 07:10 famosab

Thank you for the extra information! I will talk to the wider team about other solutions.

gburnett-nvidia avatar Oct 24 '25 17:10 gburnett-nvidia

@famosab I did some more looking into this and it seems like you're getting this error because my original solution involved creating a symlink in the index folder, which you might not have write permissions for. I propose a new solution (see PR) that instead creates a new ref folder with symlinks for all the reference and index files. I'm hoping that this new folder will not have the same permissions issues.

gburnett-nvidia avatar Oct 30 '25 15:10 gburnett-nvidia

I tried something similar to this and that did not work :/ I think its just not possible to do this symlinking in this way on (our) cluster. It would be best if the input structure of fq2bam would not need this shuffling :)

famosab avatar Oct 30 '25 16:10 famosab

I have made some edits to the PR.

It removes the requirement for a reference file when writing BAMs (by creating a dummy reference file in the correct location so PB doesn't error out and removing it after). I added an extra nf-test that shows this process working without a reference file given.

Writing CRAMs, however, does require a reference file in the index directory (which is also a requirement of bwa, so this is just the standard). Note that the bwa mem module does not test for CRAM, so we have nothing to compare against there. There are only two options to achieve this if the reference is not already in the correct location: copying the file or symlinking it.

gburnett-nvidia avatar Oct 31 '25 20:10 gburnett-nvidia

@gburnett-nvidia are there any restrictions on the parabricks side to implement two flags --bwa_index and --fasta or similar? If I understand correctly the main issue here is not that we need to provide both the bwa index and fasta file, but that parabricks expects everything to be in the exact same folder.

FriederikeHanssen avatar Nov 06 '25 13:11 FriederikeHanssen

This problem is not unique to Parabricks, but the standard imposed by BWA that Parabricks is following. Even when you're running BWA you need to provide an index and a fasta that are in the same folder if you're generating CRAMs (for BAMs you only need to point to the index and that fasta is not needed at all), so this mimics that behavior.

Implementing two flags won't make a difference because they would both need to point to the same location.

gburnett-nvidia avatar Nov 06 '25 14:11 gburnett-nvidia

To me it seems as if the bwa/mem and bwamem2 implementation both use piping into samtools which then leads to the cram output. There you can add the fasta with a different command line flag. See the module here https://github.com/nf-core/modules/blob/master/modules/nf-core/bwa/mem/main.nf

I am unsure if bwamem or bwamem2 are even able to output crams directly and I never saw the requirement for the fasta being in the same folder (at least its not a requirement for the nf-core modules because they use samtools --reference). So I think two flags would make perfect sense?

You are right though, that the crams are not properly tested. I will open an issue for that! See #9354.

famosab avatar Nov 06 '25 15:11 famosab

The way bwa uses these flags is not safe. You can pass a reference file to bwa that doesn't exist and it will run anyways, provided it doesn't end up needing that file. This is how the bwa mem module works, it infers a fasta path from the index files and runs bwa with this non existent file. Parabricks checks that all the input file paths exist before running anything, and that's why we are seeing errors here and they are not.

In terms of remedying this on nf-core, I will discuss this two flag option with the team.

gburnett-nvidia avatar Nov 06 '25 22:11 gburnett-nvidia

Update: We will look into it more. Parabricks updates come once every 6 months so it will be some time before a nore permanent fix can be implemented.

For now we will have to keep copying files since the symlinking does not work on your cluster.

gburnett-nvidia avatar Nov 07 '25 00:11 gburnett-nvidia

But how does that work then? Because bwa/index never puts the fasta in that folder.

famosab avatar Nov 07 '25 10:11 famosab

bwa doesn't need to put the fasta in the index folder. It just points to a path in that folder (whether that path exists or not) and then just takes the index files from the folder, ignoring the path you passed in. So how does it find the fasta if you never provided one? That's a mystery to me.

gburnett-nvidia avatar Nov 07 '25 15:11 gburnett-nvidia

@famosab @FriederikeHanssen I made some updates to clean up this PR (filling out the PR checklist and adding better handle the cleanup of the dummy files). Do you support this PR now?

Releasing a new version of Parabricks with two separate flags would require too many engineering cycles right now and my proposed fix passes all the nf-tests. Similarly, bwameth/align implements symlinking to handle their index files, so this solution is being used in other places and should be sufficient to merge this PR.

gburnett-nvidia avatar Nov 10 '25 19:11 gburnett-nvidia

Hey @gburnett-nvidia I understand the extra effort puts an extra strain on this. However, I tried multiple declarations of manually symlinking and none of them work on our cluster and I think they will not work on the other clusters either. Maybe its sensible to change the way the bwa/index module works in respect to parabricks such that it adds the fasta to the output folder?

famosab avatar Nov 14 '25 15:11 famosab