merqury icon indicating copy to clipboard operation
merqury copied to clipboard

Question: a way to extract read-only k-mers

Open bmansfeld opened this issue 4 years ago • 14 comments

Hi again Dr. Rhie, I'm having a great time using merqury to asses quality and completeness of different versions of our assembly. It's an awesome tool and it is allowing us to make more informed decisions about haplotig purging and other parameter tweaks. So thanks again!

As I was examining our spectra I noticed our assembly is missing some heterozygosity ie a black (read-only) peak at 1-copy.

I was wondering if there is a way to pull out those kmers specifically and then the reads they came from. My thought is that we could then assemble at least part of the missing heterozygosity using the short reads then add that to our haplotigs to get a more complete assembly.

Can you suggest a way to extract those kmers and their sequence for such a purpose?

Thanks in advance, Ben

bmansfeld avatar Jun 26 '20 19:06 bmansfeld

Hi Ben,

Glad to hear that merqury is being useful!

1. Collect all k-mers in your assembly

meryl count k=$k asm1.fasta output asm1.meryl
# If you have 2 haplotype assemblies, do the next as well
meryl count k=$k asm2.fasta output asm2.meryl
meryl union-sum asm1.meryl asm2.meryl output asm1_and_asm2.meryl

2. Subtract them from the read set

meryl difference read.meryl asm1.meryl output read.0.meryl
# or
meryl difference read.meryl asm1_and_asm2.meryl output read.0.meryl

read.0.meryl contains k-mer counts in the reads, not observed in your assemblies. You can double check if the histogram matches what you saw in the spectra-cn or spectra-asm plot. meryl histogram generates the histogram.

3. Extract reads containing the read-only mers

meryl-lookup -include is designed for this purpose.

# For paired-end reads (which I assume you are planning to apply):
meryl-lookup -include -sequence my_R1.fastq.gz -mers read.0.meryl -sequence2 R2.fastq.gz -r2 read_only_R2.fastq.gz | pigz -c > read_only_R1.fastq.gz

# For single-end reads (useful if you have long-reads):
meryl-lookup -include -sequence my_read.fastq.gz -mers read.0.meryl | pigz -c > read_only.fastq.gz

Hope this helps! Arang

arangrhie avatar Jun 26 '20 19:06 arangrhie

Awesome! thanks so much for the quick and thoroughly laid out answer!

bmansfeld avatar Jun 26 '20 19:06 bmansfeld

Forgot to add the -c in pigz. Fixed in my above comment. Feel free to come back in case there are any other errors. Will leave this issue open so I remember to document this somewhere.

arangrhie avatar Jun 26 '20 20:06 arangrhie

Arang, I realize that this is out of the scope of Merqury issues, but I'm having some memory issues running the read extraction steps.

Meryl-lookup was using something like 260G of RAM (out of 400+ available) but crashed with a cannot allocate memory error:

-- Loading kmers from 'read.0.meryl' into lookup table.

 p       prefixes             bits gigabytes (allowed: 472 GB)
-- -------------- ---------------- ---------
22        4194304     137555580373    16.014
23        8388608     134303832626    15.635
24       16777216     131320520335    15.288
25       33554432     128874078956    15.003
26       67108864     127501379401    14.843 (smallest)
27      134217728     128276163494    14.933
28      268435456     133345914883    15.524
29      536870912     147005600864    17.114
30     1073741824     177845156029    20.704
31     2147483648     243044449562    28.294
32     4294967296     376963219831    43.884
33     8589934592     648320943572    75.474
34    17179869184    1194556574257   139.065
35    34359738368    2290548018830   266.655 (used)
36    68719476736    4486051091179   522.245
37   137438953472    8880577419080  1033.835
38   274877906944   17673150258085  2057.425
39   549755813888   35261816119298  4105.016
-- -------------- ---------------- ---------

For 3520183203 distinct 20-mers (with 35 bits used for indexing and 5 bits for tags):
  266.655 GB memory
  256.000 GB memory for index (34359738368 elements 64 bits wide)
    2.049 GB memory for tags  (3520183203 elements 5 bits wide)
    8.606 GB memory for data  (3520183203 elements 21 bits wide)

Will load 3520183203 kmers.  Skipping 0 (too low) and 0 (too high) kmers.
Allocating space for 3520183203 suffixes of 5 bits each -> 17600916015 bits (2.049 GB) in blocks of 32.000 MB
                     3520183203 values   of 21 bits each -> 73923847263 bits (8.606 GB) in blocks of 32.000 MB
Loaded 3520183203 kmers.  Skipped 0 (too low) and 0 (too high) kmers.
-- Opening sequences in 'R1.uniq.fastq.gz'.
ERROR:  Failed to open input file 'R1.uniq.fastq.gz': Cannot allocate memory

I'm using meryl from here: https://github.com/marbl/meryl

Any advice? Thanks in advance! Ben

bmansfeld avatar Jul 03 '20 19:07 bmansfeld

Arang, I've just upped the memory requested to 512G and seems to be running well. I'll update if something still fails. Thanks in anycase Ben

bmansfeld avatar Jul 06 '20 19:07 bmansfeld

Hi Ben,

Sorry for the delayed reply. You can set -memory option in meryl-lookup. For your example, you could use -memory 20 which will limit to use at maximum 20G.

Arang

arangrhie avatar Jul 07 '20 14:07 arangrhie

Not a delayed response at all - Nothing to apologize for. Thanks I found that -memory option and raised it to match my requested memory from the scheduler and all seems to run well now but rather slow. I wonder if we are wasting time extracting a lot of "error-kmer" reads (ie those w/ less multiplicity than the 1-copy mers)? I'm looking in to first clipping the data base using meryl greater-than N and then running lookup. What do you think? is this a smart strategy or am I going to loose a lot of data that might be useful for the purpose of assembling those reads? Thanks in advance, Ben

bmansfeld avatar Jul 07 '20 14:07 bmansfeld

I think that's a fair strategy. I'd have done the same thing. From the histogram (meryl histogram read.0.meryl), I'll pick the multiplicity N (first col) which has the minimum count (2nd col). All k-mers below this N are likely erroneous anyway.

You could also limit the high-copy k-mers, with meryl less-than, to only include the single copy k-mers to avoid reads from repetitive regions not assembled.

Arang

arangrhie avatar Jul 07 '20 14:07 arangrhie

Awesome thanks!

bmansfeld avatar Jul 07 '20 14:07 bmansfeld

Hi there,

Has the syntax for meryl-lookup changed? I updated to v1.3 and ran the same commands as I have in the past that worked fine but now it seems parts to the command aren't being recognized. I have attached a screenshot of the error. Any help would be appreciated. Thanks. meryl error

kneubehl avatar Aug 25 '21 14:08 kneubehl

Hello @kneubehl , it seems like your meryl is not updated, or used from the previous path. Could you double check this is the path where v1.3 was installed?

arangrhie avatar Sep 15 '21 12:09 arangrhie

I checked the version of both meryl and meryl-lookup and they are v1.3. Meryl-1.3 is located in /usr/local/bin/ in path.

Regards,

Alex Kneubehl

PhD Candidate

Laboratory of Dr. Job Lopez Vector Biology and Bacterial Pathogens Lab

Department of Pediatrics

National School of Tropical Medicine at Baylor College of Medicine

Twitter: @AlexKneubehl


From: Arang Rhie @.> Sent: Wednesday, September 15, 2021 7:58 AM To: marbl/merqury @.> Cc: Kneubehl, Alexander Robert @.>; Mention @.> Subject: Re: [marbl/merqury] Question: a way to extract read-only k-mers (#9)

CAUTION: This email is not from a BCM Source. Only click links or open attachments you know are safe.


Hello @kneubehlhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kneubehl&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=1uCrRAd4twH1SiJW__K4hyA_zjt6hoGJEPLTrig080I&e= , it seems like your meryl is not updated, or used from the previous path. Could you double check this is the path where v1.3 was installed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_marbl_merqury_issues_9-23issuecomment-2D919995260&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=Le9LdB8fp51INPoZi9EH0mbMceeK9vmoWvIzuLRPOF8&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOBL37DUOH76MPJZ3ZVK5K3UCCJ7DANCNFSM4OJTDNIA&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=AWNlUsmBzOQIczPYBm4Woz_lIxhdDVWweC1_R54G4Wg&e=. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=q5Xj_oQVXLG45JyYm8EpRajrgIxOcEllXPfImVJin4Y&e= or Androidhttps://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=_wd1MVrVilyIjJXohbw57iX2LkbqwMUFXL1b8sBP1gw&e=.

kneubehl avatar Sep 15 '21 19:09 kneubehl

Alex, this is the help message for meryl-lookup in v1.3:

usage: ./meryl-lookup <report-type> \
         -sequence <input1.fasta> [<input2.fasta>] \
         -output   <output1>      [<output2>] \
         -mers     <input1.meryl> [<input2.meryl>] [...] [-estimate] \
         -labels   <input1name>   [<input2name>]   [...]

  Compare kmers in input sequences against kmers in input meryl databases.

  Input sequences (-sequence) can be FASTA or FASTQ, uncompressed, or
  compressed with gzip, xz, or bzip2.

  Report types:

  -bed:
     Generate a BED format file showing the location of kmers in
     any input database on each sequence in 'input1.fasta'.

  -wig-count:
     Generate a WIGGLE format file showing the multiplicity of the
     kmer starting at each position in the sequence, if it exists in
     an input kmer database.

  -wig-depth:
     Generate a WIGGLE format file showing the number of kmers in
     any input database that cover each position in the sequence.

  -existence:
     Generate a tab-delimited line for each input sequence with the
     number of kmers in the sequence, in the database and common to both.

  -include:
  -exclude:
     Copy sequences from 'input1.fasta' (and 'input2.fasta') to the
     corresponding output file if the sequence has at least one kmer
     present (include) or no kmers present (exclude) in 'input1.meryl'.

Run `./meryl-lookup <report-type> -help` for details on each method.

No report-type (-bed, -wig-count, -wig-depth, -existence, -include, -exclude) supplied.

How did you install meryl?

arangrhie avatar Sep 15 '21 19:09 arangrhie

Also please use this script for excluding reads having kmers from a given kmer db. The syntax is:

meryl-lookup -exclude -sequence $read1 $read2 -mers $meryl -output ${out_1} ${out_2}

for paired-end (short) reads

or

meryl-lookup -exclude -sequence $read1 -mers $meryl -output ${out_1}

for single-end (long) reads.

arangrhie avatar Sep 15 '21 19:09 arangrhie