Add command-line options for custom reference genomes
Add command-line options for custom reference genomes
Changes
This PR adds the ability to specify custom reference genome files directly via command-line options, making Yleaf more flexible and easier to use in a pipeline, with pre-specified reference genomes.
Added features:
- New command-line options:
-fg/--full_genome_reference: Allows users to specify a custom full genome reference file-yr/--y_chromosome_reference: Allows users to specify a custom Y chromosome reference file
- New utility script
extract_y_chromosome.py: For extracting just the Y chromosome from a full genome reference - Updated README with documentation and examples for the new features
Motivation
Previously, users were required to modify the config.txt file to use custom reference genomes. This approach had several limitations:
- It wasn't easily scriptable (required file edits)
- It didn't allow for different references in the same pipeline
- It made it harder to use Yleaf in automated workflows
With these changes, users can specify reference files directly on the command line, making it easier to:
- Use custom reference files for specific analyses
- Integrate Yleaf into automated pipelines
- Run Yleaf with different references without modifying configuration files
Testing
The changes have been tested with:
- Standard pre-existing reference download flow (backward compatibility)
- Custom full genome and Y chromosome references
- Various file formats and real-world datasets
Example of successful execution with no custom references and a VCF output by nf-core/eager:
python3 -m yleaf.Yleaf -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -rg hg38 -o test_output -t 16 -dh
INFO 10:32:02.076430 (0.001 s) - Running Yleaf with command: /home/a/Yleaf/yleaf/Yleaf.py -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -rg hg38 -o test_output -t 16 -dh
INFO 10:32:02.076593 (0.001 s) - No reference genome version was found. Downloading the hg38 reference genome. This should be a one time thing.
INFO 10:32:02.076702 (0.001 s) - Starting with preparing hg38...
INFO 10:47:19.166804 (917.091 s) - Finished downloading the reference genome.
/home/a/Yleaf/yleaf/Yleaf.py:188: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df_fmf = pd.concat([df_belowzero, df_readsthreshold, df_basemajority, df_discordantgenotype], axis=0, sort=True)
INFO 10:47:24.294138 (922.218 s) - Finished extracting genotypes for ancient0003.chrY.filtered.vcf.gz
INFO 10:47:24.304648 (922.229 s) - Starting haplogroup prediction...
INFO 10:47:24.403965 (922.328 s) - Starting with drawing haplogroups
INFO 10:47:24.485426 (922.410 s) - Finished drawing haplogroups
INFO 10:47:24.485505 (922.410 s) - Done!
╰─ cat test_output/hg_prediction.hg
Sample_name Hg Hg_marker Total_reads Valid_markers QC-score QC-1 QC-2 QC-3
ancient0003.chrY.filtered Q-Z770 CTS11007 VCF 128 1.0 1.0 1.0 1.0
Example of successful execution with custom references and a VCF output by nf-core/eager:
python3 -m yleaf.Yleaf -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -fg /references/reference_genomes/hg38.analysisSet.fa -yr /references/reference_genomes/hg38.chrY.analysisSet.fa -rg hg38 -o test_output -t 16
INFO 10:16:56.748596 (0.001 s) - Running Yleaf with command: /home/a/Yleaf/yleaf/Yleaf.py -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -fg /references/reference_genomes/hg38.analysisSet.fa -yr /references/reference_genomes/hg38.chrY.analysisSet.fa -rg hg38 -o test_output -t 16
INFO 10:16:56.748752 (0.001 s) - Using custom full genome reference: /references/reference_genomes/hg38.analysisSet.fa
INFO 10:16:56.748841 (0.001 s) - Using custom Y chromosome reference: /references/reference_genomes/hg38.chrY.analysisSet.fa
/home/a/Yleaf/yleaf/Yleaf.py:188: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df_fmf = pd.concat([df_belowzero, df_readsthreshold, df_basemajority, df_discordantgenotype], axis=0, sort=True)
INFO 10:17:01.835815 (5.088 s) - Finished extracting genotypes for ancient0003.chrY.filtered.vcf.gz
INFO 10:17:01.844207 (5.096 s) - Starting haplogroup prediction...
INFO 10:17:01.931244 (5.183 s) - Done!
> cat test_output/hg_prediction.hg
Sample_name Hg Hg_marker Total_reads Valid_markers QC-score QC-1 QC-2 QC-3
ancient0003.chrY.filtered Q-Z770 CTS11007 VCF 128 1.0 1.0 1.0 1.0
@dionzand @dmontielg @bramvanwersch I have an improvement I'd like to offer to make Yleaf easier to integrate into pipelines: just the ability to specify the reference genomes on the command line.
I hope the pull request is clear, but if not, please ask questions!
Thanks @trianglegrrl for your PR! Please give me some time to review. Hopefully next week
Thank you @dionzand ! I'm building a Nextflow module for Yleaf and we're going to integrate it into Eager, so I'll be excited to get your comments! Happy to make any changes you request.
Hi again @dionzand ! I'm wondering if you've had the opportunity to review this?
We're hoping to integrate this into our pipelines and make it available to the larger nextflow/nf-core community as an nf-core module. The draft pull request for the module is here: https://github.com/nf-core/modules/pull/8210)
For testing, I've pinned my nf-core module to my local 3.3.0 version of Yleaf (from this branch). It's working great.
To release the module, though, we'll need the Yleaf 3.3.0 version on bioconda. This PR does bump the version to 3.3.0, but you'll have to tag the release yourself, I believe!
I can take care of the bioconda/biocontainer stuff for you and just link it here. I am also happy to commit to keeping bioconda up to date for you. At this point, now that Yleaf 3.2.1 is in bioconda, it's easy to create update PRs in bioconda-recipes for version upgrades.
@dionzand Hate to bug you again with this, but any chance you'll have some time to review it? I'm sure you have a thousand other things to do, too!
I'm sorry, there were indeed some other things that required priority. I requested two small changes in your PR. Otherwise I am happy to merge the changes, but will first have to check with @ArwinRalf tomorrow.
I'm sorry, there were indeed some other things that required priority. I requested two small changes in your PR. Otherwise I am happy to merge the changes, but will first have to check with @ArwinRalf tomorrow.
Of course, no worries at all, and thank you! I will get those changes in and tag you again for review.
Hi @dionzand, @ArwinRalf
@trianglegrrl is working hard on adding MT and Y haplotyping functionality for the upcoming 3.0.0 release of nf-core/eager (Github).
The addition of Yleaf on bioconda (#24 thanks!) and the added flexibility from this PR would help us greatly to include Yleaf in the pipeline.
Users of this functionality in nf-core/eager would then need cite your original paper.
Did you have a chance to discuss the changes? We're happy to make changes if necessary. :)