Bed file option for Ordered-Histgrowth
Hi again,
I have a few more burning questions:
- In the example of ordered-histgrowth from this link (https://github.com/marschall-lab/panacus/blob/main/examples/ordered_pangenome_growth_statistics.md), it only uses a single chromosome. That’s why the total size shows around 175M after adding NA21309. If that’s correct, should the plot for the full genome (22 autosomes and 2 sex chromosomes) reach 3.1B after adding the final assembly?
- I noticed that the -s and -e options allow the use of a BED file.
-s, --subset
If I want to identify which genomic regions contribute to graph growth, is there a way to investigate this via Panacus using CHM13’s BED file with the -s and -e options?
Looking forward to your insights!
Cheers,
Taek
Hi Taek,
for this plot all 24 chromosomes where taken into account. As described in the example, only the fraction of the graph were quantified that are not touched by the reference genome GRCh38. In other words, each node through which the reference genome traverses, is ignored. The subset option was used to quantify only HPRC haplotypes (this excludes CHM13).
If you want to know what genomic regions contributed to the plot, take the graph, subtract all GRCh38 paths and extract all remaining nodes that are covered by HG* and NA* paths.
Thank you for your reply. If that’s the case, I’d like to clarify a few more things.
I followed these steps. Step 1: Establish the order of samples in the growth statistics echo 'HG01891 HG02109 HG02145 HG02257 …… HG02080 HG005' | tr ' ' '\n' > ${OUTPUT_DIR}/pggb_hprc.order.samples.txt
Step 2: Establish the exclude of samples in the growth statistics echo 'chm13 grch38' | tr ' ' '\n' > ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt
Step 3: Exclude paths from reference genome CHM13 and GRCH38 grep '^P' ${INPUT_GFA} | cut -f2 | grep -iE 'chm13|grch38' > ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt
Step 4: panacus histgrowth to calculate coverage and pangenome growth for nodes RUST_LOG=info panacus ordered-histgrowth -c bp -t${PBS_NCPUS} -l 1,2,4,49 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0.ordhstgrw.bp.tsv
Step 5: Visualize coverage histogram and pangenome growth curve with estimated growth parameters panacus-visualize -e ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0.ordhstgrw.bp.tsv > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0.ordhstgrw.bp.pdf
In Step 4, the file *pggb.ng0.ordhstgrw.bp.tsv will be used to generate the graph. Attached is the TSV file from chr1 in my trial.
As you can see, even a single chromosome (chr1) already shows a size comparable to the total chromosome size described in the Panacus example. I’ve cross-checked the steps and parameters with the manual in the provided link. Did I miss anything?
Regarding genomic region contributions, I’m not fully clear on your explanation. When you say "take the graph," do you mean the PGGB graph GFA file? And how exactly should I subtract or extract regions from it? Could you please share any relevant steps, scripts, or manuals I could follow?
Looking forward to your guidance.
Kind regards,
Taek
Hi again,
I have also tried using the same approach explained in the manual (https://github.com/marschall-lab/panacus/blob/main/examples/ordered_pangenome_growth_statistics.md).
Step 4 (followed manual): panacus histgrowth to calculate coverage and pangenome growth for nodes RUST_LOG=info panacus ordered-histgrowth -c bp -t${PBS_NCPUS} -l 1,2,4,49 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb_hprc.exculde.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0.ordhstgrw.bp.tsv
Eventually, the -e ${OUTPUT_DIR}/${BASEFILE}.pggb_hprc.exculde.samples.txt and -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt generated the same TSV file and PDF from chr1 in my trial. Both approaches generated an abnormally large chromosome size even with a single chromosome (chr1).
I’ve tried both v0.2.4 from Bioconda and the binary release from GitHub. It seems likely that Step 4 didn’t generate the correct values for the TSV file.
I have also tried the -s parameter using a bed file from CHM13v2. This attempt generated all values “0” in the TSV and PDF.
Step 4 (bed file) RUST_LOG=info panacus ordered-histgrowth -c bp -t${PBS_NCPUS} -l 1,2,4,49 -s ${CHM13_BED} -S -e ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0bed.ordhstgrw.bp.tsv
Looking forward to your insights.
Kind regards,
Take
Hi Taek,
the differing base pair counts are explained by the differing graphs. The example was generated using the Minigraph-Cactus graph, while your results are from the PGGB graph.
The Minigraph-Cactus graph does not contain any heterochromatic regions, so it already starts with fewer bases than the PGGB graph. The PGGB graph does include these regions but the sequences in these regions are seriously under-aligned meaning that we will have lots of bases in the graph where there would be only one when using a single sequence. The size in bps in the last line is not the size of the chromosome but instead the size of the graph of the chromosome, which can be in reality quite different from each other. For example, the human chr1 has a length of 248,387,328 bp (CHM13) while the PGGB graph for it contains 1,117,392,094 bps.
With regards of extracting the non-GRCh38 parts of the graph: Unfortunately, there are no tools available (at least to my knowledge) that can do this for you. My approach would be to take all the nodes in the GRCh38 path and removing them from all the other paths manually (e.g. using a bash/python script).
Best regards Peter
Thank you for your reply all the time!
So, if the difference between Minigraph-Cactus and PGGB is accepted, your explanation confirms that the TSV and graph generated from PGGB are correct. Is that right?
Additionally, regarding the chr1 example: A GFA file from PGGB with 56 samples (112 haplotypes) excluding CHM13 and GRCh38 was used in Panacus. If I want to retrieve or calculate the added sequence length for chr1 from the PGGB graph, which file or information should I refer to?
Kind regards,
Taek
Yes, the tsv and the plots for the PGGB graph should be correct.
About the second question: What do you mean with added sequence length? The number of base pairs that the non-GRCh38/CHM13 sequences add to the graph?
Best regards Peter
That’s correct. In Steps 2 and 3, it should reflect the number of base pairs added to the graph by non-GRCh38/CHM13 sequences. Cheers, Taek
This should be in the last line of the .tsv file generated in Step 4. Using the -e-Parameter you tell panacus to not count all the bps that intersect with the paths in the file (i.e. in this case GRCh38 and CHM13). Thus, the last line contains the number of bps in the graph without all the bps of GRCh38/CHM13 or in other words all the bps added by non-GRCh38/CHM13 sequences.
Best regards Peter
Thanks.
Based on your explanation and my current PGGB trial, the total accumulated base pairs added by non-GRCh38/CHM13 sequences (with the -l 1,2,4,49 option) would be 1,376,115,016 from HG005: 1086959399 + 224525351 + 63802404 + 827862.
Or is it just 1,086,959,399 from the first -l 1 information?
Cheers,
Taek
It is just the first number, e.g. 1,086,959,399.
The other numbers apply to the other coverages/quora respectively. So for example, if you would like to only take into account nodes that appear at least in two paths (because they might be errors, etc.), you would take the second value (in this case) 224,525,351 (since it is for a coverage of 2).
Best regards Peter
Thank you for your reply and clarification.
Based on the manual and my trials, I can generate the Panacus growth plot for specific assemblies using Steps 1, 2, and 3. However, I’m wondering if there’s a way to extract only specific assemblies from the PGGB-generated GFA file. For example, is it possible to extract just the paths covered by ^HH* and ^NA* to make a subset of a GFA file?
Kind regards,
Taek
Yes, it is possible. This could be done manually or by creating a list of all the paths one wants to keep and using trim-graph on it. This will remove all node/edges not covered by the paths in the list. After that you might want to run odgi unchop on the graph to merge all the nodes that can now be merged since they are not separated anymore.
Best regards Peter
Thank you so much for your help! I’ll proceed with the programs as you suggested.
I have another question regarding the use of bed files in Panacus.
FYI, I retrieved the bed file from CHM13 JHU RefSeqv110 + Liftoff v5.2 (https://github.com/marbl/CHM13/tree/master) and converted the GFF3 file into a six-column bed file.
Step 4 (bed file) RUST_LOG=info panacus ordered-histgrowth -c bp -t${PBS_NCPUS} -l 1,2,4,49 -s ${CHM13_BED} -S -e ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0bed.ordhstgrw.bp.tsv
However, the generated TSV file contains only “0” values. Do I need to provide a separate bed file for each assembly? Is there a more effective way to use the -s parameter in Panacus?
Looking forward to your insights.
Kind regards,
Taek
Yes, there is definitely an issue there. I'm not sure about what exactly the problem is with the bed file. Can you maybe send me the first few lines of the bed file? Likely there is a mismatch between the first field of the bed file and the path names in the graph.
Best regards Peter
This is the bed file I used.
chr1 6046 13941 transcript_id . - chr1 6046 6420 transcript_id . - chr1 12077 12982 transcript_id . - chr1 13444 13579 transcript_id . - chr1 13679 13941 transcript_id . - chr1 15079 21429 transcript_id . + chr1 15079 15564 transcript_id . + chr1 20565 21429 transcript_id . + chr1 20528 37628 transcript_id . - chr1 20528 21087 transcript_id . -
cheers,
Taek
Thank you for the file.
Okay, so the problem is that panacus cannot match the "chr1" in the first field of the bed file to any path in the graph. If you replace the "chr1" everywhere in the bed file with "chm13#chr1" (name of the corresponding path in the graph) it should work.
However, this will give you a hist/growth file having only two lines, since the graph is then subsetted to only have the chm13 path. If you wanted to subset all paths correspondingly you would have to liftover the annotations to other paths and include all of them in the bed file.
Best regards Peter
Thank you for your reply.
For the single merged subset bed file, I created a dummy bed file (see below) to test the -s parameter, and it worked.
chm13#chr1 6046 13941 transcript_id . - chm13#chr1 12077 12982 transcript_id . - HG01891#chr1 13444 13579 transcript_id . - HG02109#chr1 15079 21429 transcript_id . + . . .
Hope this is the correct approach.
cheers,
Taek
Yes, this should be the correct approach.
Best regards Peter
Happy New Year!
I have another burning question. When calculating ordered (accumulated) growth, how does the program handle relative deletions/insertions compared to the reference or other samples?
Happy new year to you too!
I'm not 100% sure I understand the question completely, but let me try to explain the ordered growth.
Panacus itself has no concept of insertions/deletions, it just works on the graph itself. So everything that is not included in the first path in a ordered growth plot counts towards the growth of the graph. Since deletions are represented not by additional nodes, but by additional edges, they are only represented in growth curves that look at the growth in edges. Panacus also has no concept of a reference, so it can be used with reference-free graphs. In the ordered histgrowth plot only the order is relevant.
Best regards Peter
Hi again,
I hope you're doing well. I’d like to verify a few points regarding how the -e and -s options work in Panacus (v0.3.3) for calculating pangenome growth.
Based on the manual:
- -e is used to exclude node IDs from the growth calculation.
- -s is used to subset (i.e., include only specified node IDs).
For both -e and -s, I used 3-column BED files with format:
- NodeID, Start, End
- e.g. chm13#2#chr1 200000 5000000
Does Panacus’s -e option interpret the first column of the BED as a list of path names and then exclude all node IDs that appear on any of those paths, regardless of the start and end coordinates?
I’d like to exclude only the intervals specified in -e ${REF_BED} (not the entire node/path), while still including other masked features from -s.
Thank you for your help!
Best regards,
Taek
Hi Taek,
panacus should be able to exclude parts of the path depending on the coordinates. This seems to be some kind of issue. Can you send me what command you are running and how the files look? E.g. path names in the GFA/the BED file?
Best regards Peter
@OZTaekOppa Hi Taek, it seems you ran into some panacus bug that we'd like to get hold of. Can you please send us a mock dataset for which the error occurs?