pbbioconda
pbbioconda copied to clipboard
isoseq refine
Hi,
I am running the isoseq pipeline documented here: https://isoseq.how/umi/cli-workflow.html
Using the single-cell iso-seq workflow.
I start my pipeline with hifi reads generated from 10x 3' chemistry and have run the first few steps: primer removal, tag removal and now am refining the reads
isoseq refine ${tag_removed_bam_path} /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta \$sample_name.fltnc.bam --require-polya
For some reason this is eliminating nearly all my reads across all my samples -- I am left with just a handful of reads (for context my bam file size pre and post processing goes from 30GB to 8Kb)
Is there something I am doing wrong here?
Thanks!
Does you data have polyA tails? Otherwise, please upload a tiny example BAM with reads that get removed in refine
.
Hi, thank you for your very quick response!
I'm not sure I'll be able to upload a bam due to data governance/restrictions but let me see what I can do.
However, looking at my reads in the flt.bam I can see evidence of polyAtails.
I tried refine with and without the polyatail setting and the resulting reads didn't change all that much (still lost nearly all my data)
Please paste the footer of the log output if you run it with --log-level INFO
and also check the reports
|> 20240321 19:54:37.853 -|- INFO -|- ParsePositionalArgs -|- 0x7fb0766860c0|| -|- Input Barcode file: /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Primer prefixes used to detect concatemers: |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AAGCAGTGGTATCAACGCAGAGTACATGGG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AGATCGGAAGAGCGTCGTGTAG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CCCATGTACTCTGCGTTGATACCACTGCTT |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CTACACGACGCTCTTCCGATCT |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output FLNC bam: test.fltnc.bam |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output summary json: test.fltnc.filter_summary.report.json
It is running now but here is what I see so far
Here is the contents of my fasta file too (comes from the 10x 3' chemistry that is available on the isoseq website)
5p AAGCAGTGGTATCAACGCAGAGTACATGGG 3p AGATCGGAAGAGCGTCGTGTAG
|> 20240321 19:54:37.853 -|- INFO -|- ParsePositionalArgs -|- 0x7fb0766860c0|| -|- Input Barcode file: /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Primer prefixes used to detect concatemers: |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AAGCAGTGGTATCAACGCAGAGTACATGGG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AGATCGGAAGAGCGTCGTGTAG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CCCATGTACTCTGCGTTGATACCACTGCTT |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CTACACGACGCTCTTCCGATCT |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output FLNC bam: test.fltnc.bam |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output summary json: test.fltnc.filter_summary.report.json
|> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Input reads : 7264302 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output reads : 19 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output bases : 24777 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Filtered RQ reads : 0 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Missing RQ reads : 0 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Run Time : 11m 944us |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CPU Time : 9h 52m |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Peak RSS : 0.88908 GB
Here is my full log file
And my reports: id,strand,fivelen,threelen,polyAlen,insertlen,primer m84093_240202_144742_s1/263197381/ccs,+,0,8,25,477,5p--3p m84093_240202_144742_s1/219156810/ccs,+,8,0,28,1451,5p--3p m84093_240202_144742_s1/204345334/ccs,+,8,0,25,1561,5p--3p m84093_240202_144742_s1/123997463/ccs,+,0,8,31,1619,5p--3p m84093_240202_144742_s1/112922188/ccs,+,0,8,41,375,5p--3p m84093_240202_144742_s1/109251907/ccs,+,0,9,31,2371,5p--3p m84093_240202_144742_s1/91559779/ccs,+,0,8,27,786,5p--3p m84093_240202_144742_s1/92542161/ccs,+,0,8,26,1685,5p--3p m84093_240202_144742_s1/75962115/ccs,+,8,0,26,1060,5p--3p m84093_240202_144742_s1/87888614/ccs,+,7,8,102,532,5p--3p m84093_240202_144742_s1/46536294/ccs,+,8,0,28,1906,5p--3p m84093_240202_144742_s1/25367936/ccs,+,8,0,29,1614,5p--3p m84093_240202_144742_s1/34669647/ccs,+,0,8,60,587,5p--3p m84093_240202_144742_s1/35783133/ccs,+,0,8,29,1043,5p--3p m84093_240202_144742_s1/65668734/ccs,+,8,0,31,1578,5p--3p m84093_240202_144742_s1/118557153/ccs,+,8,0,74,1614,5p--3p m84093_240202_144742_s1/119408775/ccs,+,0,8,27,847,5p--3p m84093_240202_144742_s1/111478958/ccs,+,8,0,22,1484,5p--3p m84093_240202_144742_s1/174523972/ccs,+,8,0,45,2187,5p--3p
{ "_comment": "Created by pbcopper v2.3.99", "attributes": [ { "id": "sample_name", "name": "Sample Name", "value": "Isogut14548275" }, { "id": "num_reads_fl", "name": "Full-Length Reads", "value": 7264302 }, { "id": "num_reads_flnc", "name": "Full-Length Non-Chimeric Reads", "value": 22 }, { "id": "num_reads_flnc_polya", "name": "Full-Length Non-Chimeric Reads with Poly-A Tail", "value": 19 } ], "dataset_uuids": [], "id": "isoseq_refine", "plotGroups": [], "tables": [], "title": "Iso-Seq Refine Report", "uuid": "c745eac5-a13f-4fdb-bd0f-2445e78ecb6b", "version": "1.0.1" }
Hi @GelatinousGiant -
Can you please give a bit more context?
Is this using the MAS-Seq for 10x Single Cell 3' kit?
Is this 10x 3' cDNA?
Did you first run skera to get the segmented reads (S-reads) and use that as input to lima
and isoseq refine
?
Closing due to lack of input
Hi @Magdoll and @armintoepfer,
I am new to PacBio long read sequencing. I received hifi_reads.fastq, and hifi_reads.bam files which I assume are the unaligned CCS reads, output of CCS tool. I believe our run was on MAS-Seq for 10x Single Cell 3' kit and the reads are the 10X cDNAs. I am facing the similar issue with @GelatinousGiant. I started off with 23G flt.bam ending up with 170K fltnc.bam. My CL:
$ isoseq refine m64410e.flt.bam primers.fasta m64410e.fltnc.bam --require-polya
I see that my flt.bam has about ~12K bases long reads with polyA tracks. Should I have ran skera split first to segment the reads? I went straight to lima with the hifi_reads.bam.
Thanks.