Add command-line options for custom reference genomes

Changes

This PR adds the ability to specify custom reference genome files directly via command-line options, making Yleaf more flexible and easier to use in a pipeline, with pre-specified reference genomes.

Added features:

New command-line options:
- -fg/--full_genome_reference: Allows users to specify a custom full genome reference file
- -yr/--y_chromosome_reference: Allows users to specify a custom Y chromosome reference file
New utility script extract_y_chromosome.py: For extracting just the Y chromosome from a full genome reference
Updated README with documentation and examples for the new features

Motivation

Previously, users were required to modify the config.txt file to use custom reference genomes. This approach had several limitations:

It wasn't easily scriptable (required file edits)
It didn't allow for different references in the same pipeline
It made it harder to use Yleaf in automated workflows

With these changes, users can specify reference files directly on the command line, making it easier to:

Use custom reference files for specific analyses
Integrate Yleaf into automated pipelines
Run Yleaf with different references without modifying configuration files

Testing

The changes have been tested with:

Standard pre-existing reference download flow (backward compatibility)
Custom full genome and Y chromosome references
Various file formats and real-world datasets

Example of successful execution with no custom references and a VCF output by nf-core/eager:

python3 -m yleaf.Yleaf -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -rg hg38 -o test_output -t 16 -dh

INFO 10:32:02.076430 (0.001 s) - Running Yleaf with command: /home/a/Yleaf/yleaf/Yleaf.py -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -rg hg38 -o test_output -t 16 -dh
INFO 10:32:02.076593 (0.001 s) - No reference genome version was found. Downloading the hg38 reference genome. This should be a one time thing.
INFO 10:32:02.076702 (0.001 s) - Starting with preparing hg38...
INFO 10:47:19.166804 (917.091 s) - Finished downloading the reference genome.
/home/a/Yleaf/yleaf/Yleaf.py:188: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_fmf = pd.concat([df_belowzero, df_readsthreshold, df_basemajority, df_discordantgenotype], axis=0, sort=True)
INFO 10:47:24.294138 (922.218 s) - Finished extracting genotypes for ancient0003.chrY.filtered.vcf.gz
INFO 10:47:24.304648 (922.229 s) - Starting haplogroup prediction...
INFO 10:47:24.403965 (922.328 s) - Starting with drawing haplogroups
INFO 10:47:24.485426 (922.410 s) - Finished drawing haplogroups
INFO 10:47:24.485505 (922.410 s) - Done!

╰─ cat test_output/hg_prediction.hg 
Sample_name     Hg      Hg_marker       Total_reads     Valid_markers   QC-score        QC-1    QC-2    QC-3
ancient0003.chrY.filtered       Q-Z770  CTS11007        VCF     128     1.0     1.0     1.0     1.0

Example of successful execution with custom references and a VCF output by nf-core/eager:

python3 -m yleaf.Yleaf -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -fg /references/reference_genomes/hg38.analysisSet.fa -yr /references/reference_genomes/hg38.chrY.analysisSet.fa -rg hg38 -o test_output -t 16

INFO 10:16:56.748596 (0.001 s) - Running Yleaf with command: /home/a/Yleaf/yleaf/Yleaf.py -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -fg /references/reference_genomes/hg38.analysisSet.fa -yr /references/reference_genomes/hg38.chrY.analysisSet.fa -rg hg38 -o test_output -t 16
INFO 10:16:56.748752 (0.001 s) - Using custom full genome reference: /references/reference_genomes/hg38.analysisSet.fa
INFO 10:16:56.748841 (0.001 s) - Using custom Y chromosome reference: /references/reference_genomes/hg38.chrY.analysisSet.fa
/home/a/Yleaf/yleaf/Yleaf.py:188: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_fmf = pd.concat([df_belowzero, df_readsthreshold, df_basemajority, df_discordantgenotype], axis=0, sort=True)
INFO 10:17:01.835815 (5.088 s) - Finished extracting genotypes for ancient0003.chrY.filtered.vcf.gz
INFO 10:17:01.844207 (5.096 s) - Starting haplogroup prediction...
INFO 10:17:01.931244 (5.183 s) - Done!

> cat test_output/hg_prediction.hg
Sample_name     Hg      Hg_marker       Total_reads     Valid_markers   QC-score        QC-1    QC-2    QC-3
ancient0003.chrY.filtered       Q-Z770  CTS11007        VCF     128     1.0     1.0     1.0     1.0

Mar 28 '25 14:03 trianglegrrl

@dionzand @dmontielg @bramvanwersch I have an improvement I'd like to offer to make Yleaf easier to integrate into pipelines: just the ability to specify the reference genomes on the command line.

I hope the pull request is clear, but if not, please ask questions!

Mar 28 '25 14:03 trianglegrrl

Thanks @trianglegrrl for your PR! Please give me some time to review. Hopefully next week

Mar 28 '25 21:03 dionzand

Thank you @dionzand ! I'm building a Nextflow module for Yleaf and we're going to integrate it into Eager, so I'll be excited to get your comments! Happy to make any changes you request.

Mar 29 '25 20:03 trianglegrrl

Hi again @dionzand ! I'm wondering if you've had the opportunity to review this?

We're hoping to integrate this into our pipelines and make it available to the larger nextflow/nf-core community as an nf-core module. The draft pull request for the module is here: https://github.com/nf-core/modules/pull/8210)

For testing, I've pinned my nf-core module to my local 3.3.0 version of Yleaf (from this branch). It's working great.

To release the module, though, we'll need the Yleaf 3.3.0 version on bioconda. This PR does bump the version to 3.3.0, but you'll have to tag the release yourself, I believe!

I can take care of the bioconda/biocontainer stuff for you and just link it here. I am also happy to commit to keeping bioconda up to date for you. At this point, now that Yleaf 3.2.1 is in bioconda, it's easy to create update PRs in bioconda-recipes for version upgrades.

Apr 04 '25 10:04 trianglegrrl

@dionzand Hate to bug you again with this, but any chance you'll have some time to review it? I'm sure you have a thousand other things to do, too!

Apr 11 '25 10:04 trianglegrrl

I'm sorry, there were indeed some other things that required priority. I requested two small changes in your PR. Otherwise I am happy to merge the changes, but will first have to check with @ArwinRalf tomorrow.

Apr 14 '25 08:04 dionzand

I'm sorry, there were indeed some other things that required priority. I requested two small changes in your PR. Otherwise I am happy to merge the changes, but will first have to check with @ArwinRalf tomorrow.

Of course, no worries at all, and thank you! I will get those changes in and tag you again for review.

Apr 23 '25 22:04 trianglegrrl

Hi @dionzand, @ArwinRalf

@trianglegrrl is working hard on adding MT and Y haplotyping functionality for the upcoming 3.0.0 release of nf-core/eager (Github). The addition of Yleaf on bioconda (#24 thanks!) and the added flexibility from this PR would help us greatly to include Yleaf in the pipeline.

Users of this functionality in nf-core/eager would then need cite your original paper.

Did you have a chance to discuss the changes? We're happy to make changes if necessary. :)

May 23 '25 10:05 TCLamnidis