graphtyper icon indicating copy to clipboard operation
graphtyper copied to clipboard

regenotyped samples produces all Genptype (GT) with ./.

Open Solyris83 opened this issue 3 years ago • 8 comments

Hi, I have regenotyped 9000+ samples and gotten a small set of 72 samples with all genotype (GT) results to unknown (./.)

The other 9000 samples have around 2400 structural variant (SV) which are having unknown genotype out of a grand total of ~200,000 structural variant (SV) loci

I have included herein a list of log file for 1 sample with -vverbose in Grapyhtyper run from a scatter-gather approach. work_Graphtyper2.zip

The command run is found below with shell variables to the respective needed items graphtyper genotype_sv $REF_FA $MERGED_VCF
--region $region --sam $INPUT_CRAM --output $output_dir --log $output_dir/log.txt
--vverbose

These samples have a relevant Manta processed VCF too which managed to call out Genotypes.

I am unable to provide the VCF file due to non-disclosure clause for these samples.

Please do advise on what other files are useful in diagnosing the problem.

Solyris83 avatar Feb 08 '21 05:02 Solyris83

Hi, I looked at one log file (it was for chr13) which had the message "No reads found in BAM" thrown in every region. graphtyper has several read filters set which might accidentally be filtering out every read for some reason. Hard to say why without the files. Can you show me a handful of reads (i.e. using samtools view $INPUT_CRAM chr13:20000000-20100000 | head) so I can get an idea why the reads are being filtered? You may remove the DNA sequence if you need to due to non-disclosure clause. Could you also tell which version of graphtyper you are using?

Best, Hannes

hannespetur avatar Feb 08 '21 12:02 hannespetur

Hi Hannes,

Thanks for the quick reply, the Graphtyper version used is VERSION 2.5.1.

Reads captured with the below line can be found above, as said the actual bases are replaced with “BASE”. samtools view --reference Homo_sapiens_assembly38.fasta sample.bqsr.cram chr13:20000001-21000000 | head -20

Regards Zhihui

From: Hannes Pétur Eggertsson [email protected] Sent: Monday, February 8, 2021 8:35 PM To: DecodeGenetics/graphtyper [email protected] Cc: LI Zhihui [email protected]; Author [email protected] Subject: Re: [DecodeGenetics/graphtyper] regenotyped samples produces all Genptype (GT) with ./. (#67)

Hi, I looked at one log file (it was for chr13) which had the message "No reads found in BAM" thrown in every region. graphtyper has several read filters set which might accidentally be filtering out every read for some reason. Hard to say why without the files. Can you show me a handful of reads (i.e. using samtools view $INPUT_CRAM chr13:20000000-20100000 | head) so I can get an idea why the reads are being filtered? You may remove the DNA sequence if you need to due to non-disclosure clause. Could you also tell which version of graphtyper you are using?

Best, Hannes

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/DecodeGenetics/graphtyper/issues/67#issuecomment-775116673, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AP2SLWVM4ZB6VFW5DPRQ6HDS57LA3ANCNFSM4XILEORQ.

This e-mail and any attachments are only for the use of the intended recipient and may contain material that is confidential, privileged and/or protected by the Official Secrets Act. If you are not the intended recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person.

Solyris83 avatar Feb 09 '21 05:02 Solyris83

Where above? I don't see the output.

hannespetur avatar Feb 09 '21 12:02 hannespetur

Hmmm, I sent it in the email, must have been cut off. Sorry about that, attached herein from Github page example.zip

Solyris83 avatar Feb 10 '21 01:02 Solyris83

Okay thanks, these reads look normal to me.. I don't see any reason why graphtyper would ignore those. The problem must be something else, maybe it is having some issues retrieving the reads from the CRAM.

I see that you have set a custom reference in your samtools view command, was this necessary to open the CRAM? If so, it is possibly the issue. This means the reference is not specified in the CRAM header and also not specified in either REF_PATH or REF_CACHE environment variables, and then htslib can't tell which reference FASTA it should use. This should have triggered an error message from htslib but these would not be in the graphtyper logs. graphtyper v2.5.x uses the same search order for a reference as samtools view with no custom reference, which is essentially

  1. The REF_PATH and,
  2. the REF_CACHE environment variables. These ref caches can be created with misc/seq_cache_populate.pl from htslib. See more info: http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES
  3. Reference FASTA from the CRAM header.

In graphtyper v2.6.x I introduced also the option of using the input FASTA to use for CRAM reading in genotype_sv with the advanced option --force_use_input_ref_for_cram_reading. This essentially does the same as using a custom FASTA reference in samtools view. Can you try either creating these ref caches and specify them in REF_CACHE or upgrade to graphtyper v2.6.x and use the option --force_use_input_ref_for_cram_reading?

Best, Hannes

hannespetur avatar Feb 11 '21 16:02 hannespetur

Hi Hannes,

Thanks for the comment above, I had a quick look at the CRAM file header and indeed it is as what you have said the header reference sequence's location are missing as they are deleted already. This deletion is due to the analysis being done with Nextflow which has the work directory "cleaned-up" after analysis, and the CRAM header points the ref sequence to items in the work directory.

Following that, we have tried 2.6.2 as suggested and it indeed managed to give a genotype call.

But then we would like to keep the version number the same compared to the rest of the cohort, and the tries at REF_PATH setting has not been successful so far. We have the downloaded ref sequences in a local directory and setting REF_PATH="directory:" in my bashrc does not work. I will probably try a few other combination, as the directory currently contains a soft-link to the actual fasta file, or even the actual fasta file itself and see which works.

But if say for example, I did not manage to get it to work on version 2.5 may I know if the 2 version (2.5 vs 2.6.2) has any significant changes in terms of the regenotype calling algorithm?

Regards Zhihui

Solyris83 avatar Feb 22 '21 07:02 Solyris83

But then we would like to keep the version number the same compared to the rest of the cohort, and the tries at REF_PATH setting has not been successful so far. We have the downloaded ref sequences in a local directory and setting REF_PATH="directory:" in my bashrc does not work. I will probably try a few other combination, as the directory currently contains a soft-link to the actual fasta file, or even the actual fasta file itself and see which works.

The REF_PATH isn't expecting to be a fasta file but a file containing directory with raw sequences (no newlines) named by their md5sum. You can create those sequences with this script https://github.com/samtools/samtools/blob/develop/misc/seq_cache_populate.pl . Download the script and populate a directory with your input FASTA. I don't think I can explain the process better than the samtools manual entry for it so please refer to the manual for more details: http://www.htslib.org/doc/samtools.html#REFERENCE_SEQUENCES

There are definately differences in the genotyping algorithm between the two versions. The raw genotyping calls should be very similar of course, but there are multiple changes to INFO fields so, for example, I think "graphtyper vcf_merge" is likely to break if the inputs have different versions.

Best, Hannes

hannespetur avatar Feb 22 '21 10:02 hannespetur

Hi Hannes,

Thanks for the recommendation, I had a harder look at the samtools documentation and realised the mistake.

Following that, I have done the following but Graphtyper2 is still producing GT with ./.

  1. /seq_cache_populate.pl -root /ref_cache Homo_sapiens_assembly38.fasta
  2. To confirm that samtools without REF_PATH is not working $samtools view WHB3537.bqsr.cram | head [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0 9998..44245 [E::cram_next_slice] Failure to decode slice [main_samview] truncated file.
  3. Add REF_PATH and REF_CACHE to bashrc and source
  4. Test samtools to confirm REF_PATH and REF_CACHE works after adding to environment $samtools view WHB3537.bqsr.cram | head Mapped reads output successfully
  5. Run Graphtyper2

Do you have any other suggestion?

Reagrds Zhihui

Solyris83 avatar Feb 23 '21 09:02 Solyris83