merqury
merqury copied to clipboard
Question: a way to extract read-only k-mers
Hi again Dr. Rhie, I'm having a great time using merqury to asses quality and completeness of different versions of our assembly. It's an awesome tool and it is allowing us to make more informed decisions about haplotig purging and other parameter tweaks. So thanks again!
As I was examining our spectra I noticed our assembly is missing some heterozygosity ie a black (read-only) peak at 1-copy.
I was wondering if there is a way to pull out those kmers specifically and then the reads they came from. My thought is that we could then assemble at least part of the missing heterozygosity using the short reads then add that to our haplotigs to get a more complete assembly.
Can you suggest a way to extract those kmers and their sequence for such a purpose?
Thanks in advance, Ben
Hi Ben,
Glad to hear that merqury is being useful!
1. Collect all k-mers in your assembly
meryl count k=$k asm1.fasta output asm1.meryl
# If you have 2 haplotype assemblies, do the next as well
meryl count k=$k asm2.fasta output asm2.meryl
meryl union-sum asm1.meryl asm2.meryl output asm1_and_asm2.meryl
2. Subtract them from the read set
meryl difference read.meryl asm1.meryl output read.0.meryl
# or
meryl difference read.meryl asm1_and_asm2.meryl output read.0.meryl
read.0.meryl
contains k-mer counts in the reads, not observed in your assemblies.
You can double check if the histogram matches what you saw in the spectra-cn or spectra-asm plot. meryl histogram
generates the histogram.
3. Extract reads containing the read-only mers
meryl-lookup -include
is designed for this purpose.
# For paired-end reads (which I assume you are planning to apply):
meryl-lookup -include -sequence my_R1.fastq.gz -mers read.0.meryl -sequence2 R2.fastq.gz -r2 read_only_R2.fastq.gz | pigz -c > read_only_R1.fastq.gz
# For single-end reads (useful if you have long-reads):
meryl-lookup -include -sequence my_read.fastq.gz -mers read.0.meryl | pigz -c > read_only.fastq.gz
Hope this helps! Arang
Awesome! thanks so much for the quick and thoroughly laid out answer!
Forgot to add the -c
in pigz. Fixed in my above comment. Feel free to come back in case there are any other errors.
Will leave this issue open so I remember to document this somewhere.
Arang, I realize that this is out of the scope of Merqury issues, but I'm having some memory issues running the read extraction steps.
Meryl-lookup was using something like 260G of RAM (out of 400+ available) but crashed with a cannot allocate memory error:
-- Loading kmers from 'read.0.meryl' into lookup table.
p prefixes bits gigabytes (allowed: 472 GB)
-- -------------- ---------------- ---------
22 4194304 137555580373 16.014
23 8388608 134303832626 15.635
24 16777216 131320520335 15.288
25 33554432 128874078956 15.003
26 67108864 127501379401 14.843 (smallest)
27 134217728 128276163494 14.933
28 268435456 133345914883 15.524
29 536870912 147005600864 17.114
30 1073741824 177845156029 20.704
31 2147483648 243044449562 28.294
32 4294967296 376963219831 43.884
33 8589934592 648320943572 75.474
34 17179869184 1194556574257 139.065
35 34359738368 2290548018830 266.655 (used)
36 68719476736 4486051091179 522.245
37 137438953472 8880577419080 1033.835
38 274877906944 17673150258085 2057.425
39 549755813888 35261816119298 4105.016
-- -------------- ---------------- ---------
For 3520183203 distinct 20-mers (with 35 bits used for indexing and 5 bits for tags):
266.655 GB memory
256.000 GB memory for index (34359738368 elements 64 bits wide)
2.049 GB memory for tags (3520183203 elements 5 bits wide)
8.606 GB memory for data (3520183203 elements 21 bits wide)
Will load 3520183203 kmers. Skipping 0 (too low) and 0 (too high) kmers.
Allocating space for 3520183203 suffixes of 5 bits each -> 17600916015 bits (2.049 GB) in blocks of 32.000 MB
3520183203 values of 21 bits each -> 73923847263 bits (8.606 GB) in blocks of 32.000 MB
Loaded 3520183203 kmers. Skipped 0 (too low) and 0 (too high) kmers.
-- Opening sequences in 'R1.uniq.fastq.gz'.
ERROR: Failed to open input file 'R1.uniq.fastq.gz': Cannot allocate memory
I'm using meryl from here: https://github.com/marbl/meryl
Any advice? Thanks in advance! Ben
Arang, I've just upped the memory requested to 512G and seems to be running well. I'll update if something still fails. Thanks in anycase Ben
Hi Ben,
Sorry for the delayed reply. You can set -memory
option in meryl-lookup.
For your example, you could use -memory 20
which will limit to use at maximum 20G.
Arang
Not a delayed response at all - Nothing to apologize for.
Thanks I found that -memory
option and raised it to match my requested memory from the scheduler and all seems to run well now but rather slow.
I wonder if we are wasting time extracting a lot of "error-kmer" reads (ie those w/ less multiplicity than the 1-copy mers)?
I'm looking in to first clipping the data base using meryl greater-than N
and then running lookup.
What do you think? is this a smart strategy or am I going to loose a lot of data that might be useful for the purpose of assembling those reads?
Thanks in advance,
Ben
I think that's a fair strategy. I'd have done the same thing.
From the histogram (meryl histogram read.0.meryl
), I'll pick the multiplicity N (first col) which has the minimum count (2nd col).
All k-mers below this N are likely erroneous anyway.
You could also limit the high-copy k-mers, with meryl less-than
, to only include the single copy k-mers to avoid reads from repetitive regions not assembled.
Arang
Awesome thanks!
Hi there,
Has the syntax for meryl-lookup
changed? I updated to v1.3 and ran the same commands as I have in the past that worked fine but now it seems parts to the command aren't being recognized. I have attached a screenshot of the error. Any help would be appreciated. Thanks.
Hello @kneubehl , it seems like your meryl is not updated, or used from the previous path. Could you double check this is the path where v1.3 was installed?
I checked the version of both meryl and meryl-lookup and they are v1.3. Meryl-1.3 is located in /usr/local/bin/ in path.
Regards,
Alex Kneubehl
PhD Candidate
Laboratory of Dr. Job Lopez Vector Biology and Bacterial Pathogens Lab
Department of Pediatrics
National School of Tropical Medicine at Baylor College of Medicine
Twitter: @AlexKneubehl
From: Arang Rhie @.> Sent: Wednesday, September 15, 2021 7:58 AM To: marbl/merqury @.> Cc: Kneubehl, Alexander Robert @.>; Mention @.> Subject: Re: [marbl/merqury] Question: a way to extract read-only k-mers (#9)
CAUTION: This email is not from a BCM Source. Only click links or open attachments you know are safe.
Hello @kneubehlhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kneubehl&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=1uCrRAd4twH1SiJW__K4hyA_zjt6hoGJEPLTrig080I&e= , it seems like your meryl is not updated, or used from the previous path. Could you double check this is the path where v1.3 was installed?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_marbl_merqury_issues_9-23issuecomment-2D919995260&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=Le9LdB8fp51INPoZi9EH0mbMceeK9vmoWvIzuLRPOF8&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOBL37DUOH76MPJZ3ZVK5K3UCCJ7DANCNFSM4OJTDNIA&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=AWNlUsmBzOQIczPYBm4Woz_lIxhdDVWweC1_R54G4Wg&e=. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=q5Xj_oQVXLG45JyYm8EpRajrgIxOcEllXPfImVJin4Y&e= or Androidhttps://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=_wd1MVrVilyIjJXohbw57iX2LkbqwMUFXL1b8sBP1gw&e=.
Alex, this is the help message for meryl-lookup in v1.3:
usage: ./meryl-lookup <report-type> \
-sequence <input1.fasta> [<input2.fasta>] \
-output <output1> [<output2>] \
-mers <input1.meryl> [<input2.meryl>] [...] [-estimate] \
-labels <input1name> [<input2name>] [...]
Compare kmers in input sequences against kmers in input meryl databases.
Input sequences (-sequence) can be FASTA or FASTQ, uncompressed, or
compressed with gzip, xz, or bzip2.
Report types:
-bed:
Generate a BED format file showing the location of kmers in
any input database on each sequence in 'input1.fasta'.
-wig-count:
Generate a WIGGLE format file showing the multiplicity of the
kmer starting at each position in the sequence, if it exists in
an input kmer database.
-wig-depth:
Generate a WIGGLE format file showing the number of kmers in
any input database that cover each position in the sequence.
-existence:
Generate a tab-delimited line for each input sequence with the
number of kmers in the sequence, in the database and common to both.
-include:
-exclude:
Copy sequences from 'input1.fasta' (and 'input2.fasta') to the
corresponding output file if the sequence has at least one kmer
present (include) or no kmers present (exclude) in 'input1.meryl'.
Run `./meryl-lookup <report-type> -help` for details on each method.
No report-type (-bed, -wig-count, -wig-depth, -existence, -include, -exclude) supplied.
How did you install meryl?
Also please use this script for excluding reads having kmers from a given kmer db. The syntax is:
meryl-lookup -exclude -sequence $read1 $read2 -mers $meryl -output ${out_1} ${out_2}
for paired-end (short) reads
or
meryl-lookup -exclude -sequence $read1 -mers $meryl -output ${out_1}
for single-end (long) reads.