ExpansionHunter icon indicating copy to clipboard operation
ExpansionHunter copied to clipboard

Advanced options

Open melnel000 opened this issue 4 years ago • 10 comments

Hi Egor

What is the difference between the dag-aligner and path-aligner and the seeking and streaming analysis modes under the advanced options? I would like to understand when it may be useful to specify these options.

Thanks, Melissa

melnel000 avatar Jan 05 '21 10:01 melnel000

Hi Melissa,

Thanks for the question! The advanced options correspond to settings with very narrow use cases that are mainly useful for testing the program.

The --aligner option selects the version of read alignment algorithm used by the program. dag-aligner is a better aligner than path-aligner because it is faster and supports affine gap penalties.

The --analysis-mode option determines if ExpansionHunter should analyze repeat regions sequentially (seeking) or all at once (streaming). streaming mode can significantly speedup analysis of large repeat catalogs, but it also requires a lot of memory.

I hope this helps! Please let me know if you have any other questions. Egor

PS: In case you'd like to try it out, we just released a new visualization tool for STRs.

egor-dolzhenko avatar Jan 05 '21 16:01 egor-dolzhenko

Hello Egor,

Is there any difference in accuracy between --analysis-mode (seeking) Vs (streaming)? And also, when you mentioned a lot of memory, could you give me an example between the two in terms of CPU/RAM specifics and running time with N repeats X N individuals? Thank you

wsproviero avatar Feb 20 '21 13:02 wsproviero

Thanks for the questions! There is no difference in accuracy between seeking and streaming analysis modes. And I will get back to you about the latest runtime/memory benchmarks. But overall, the streaming mode is currently only practical for up to 10-20K repeats.

I am working with @yjqiu, @felixschlesinger, and @kscheffler on significantly reducing the memory requirements of the streaming mode. @yjqiu is also getting ready to release a fairly large repeat catalog. All this should be done in 2-3 months and possibly much sooner.

Also, if you'd like to discuss the analysis you are planning to do, please feel free to reach out by email.

egor-dolzhenko avatar Feb 21 '21 05:02 egor-dolzhenko

Hello Egor,

This is excellent. Thank you very much for any further information you can give me. I am also trying to use the streaming mode but I have this error. Could you please let me know what I am doing wrong? Thank you again.

2021-02-25T12:06:12,[Starting ExpansionHunter v4.0.1] 2021-02-25T12:06:12,[Analyzing sample HG00479.final] 2021-02-25T12:06:12,[Initializing reference GRCh38_full_analysis_set_plus_decoy_hla.fa] 2021-02-25T12:06:12,[Loading variant catalog from disk TEMPLATE2_a.json] 2021-02-25T12:06:14,[Running sample analysis in streaming mode] Failed to populate reference for id 0 Unable to fetch reference #0 9996..29231 Failure to decode slice 2021-02-25T12:06:19,[Failed to extract a record from HG00479.final.cram]

wsproviero avatar Feb 25 '21 12:02 wsproviero

Hello William,

Sorry about the error. It looks like there is a CRAM parsing bug in the streaming mode. Could you please check if this Linux binary works? (Let me know if you need a binary for a different platform.)

ExpansionHunter-v4.0.3.gz

egor-dolzhenko avatar Feb 25 '21 21:02 egor-dolzhenko

Dear Egor,

The binary works 100% and it runs smooth and fast. From what I can tell it takes a total of 23Gb RAM. It engages only 1 thread of a cpu but I can tell that the program runs on 100% CPU idled.

Many Thanks for your help!

wsproviero avatar Feb 25 '21 22:02 wsproviero

Glad to hear it, William! And thanks for checking the memory usage. We will work on reducing memory consumption in the future releases.

egor-dolzhenko avatar Feb 25 '21 23:02 egor-dolzhenko

Good day. Could some of you please clarify --analysis-mode is needed just because of the big sizes of variant catalogs? or is it possible to use this option in terms of several in-files' usage? I mean .bam s. Thank you.

katerinaoleynikova avatar Mar 02 '21 12:03 katerinaoleynikova

Thanks for the question. That's right, "streaming" analysis mode is meant for analysis of larger variant catalogs. It has no other purpose. In the future versions of the program the streaming mode will be much more efficient making it possible to analyze large catalogs.

egor-dolzhenko avatar Mar 03 '21 04:03 egor-dolzhenko

Big thanks and good luck!

katerinaoleynikova avatar Mar 03 '21 09:03 katerinaoleynikova