pbbioconda isoseq refine

Hi,

I am running the isoseq pipeline documented here: https://isoseq.how/umi/cli-workflow.html

Using the single-cell iso-seq workflow.

I start my pipeline with hifi reads generated from 10x 3' chemistry and have run the first few steps: primer removal, tag removal and now am refining the reads

isoseq refine ${tag_removed_bam_path} /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta \$sample_name.fltnc.bam --require-polya

For some reason this is eliminating nearly all my reads across all my samples -- I am left with just a handful of reads (for context my bam file size pre and post processing goes from 30GB to 8Kb)

Is there something I am doing wrong here?

Thanks!

Mar 21 '24 19:03 marcus-tutert

Does you data have polyA tails? Otherwise, please upload a tiny example BAM with reads that get removed in refine.

Mar 21 '24 19:03 armintoepfer

Hi, thank you for your very quick response!

I'm not sure I'll be able to upload a bam due to data governance/restrictions but let me see what I can do.

However, looking at my reads in the flt.bam I can see evidence of polyAtails.

I tried refine with and without the polyatail setting and the resulting reads didn't change all that much (still lost nearly all my data)

Mar 21 '24 19:03 marcus-tutert

Please paste the footer of the log output if you run it with --log-level INFO and also check the reports

Mar 21 '24 19:03 armintoepfer

|> 20240321 19:54:37.853 -|- INFO -|- ParsePositionalArgs -|- 0x7fb0766860c0|| -|- Input Barcode file: /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Primer prefixes used to detect concatemers: |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AAGCAGTGGTATCAACGCAGAGTACATGGG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AGATCGGAAGAGCGTCGTGTAG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CCCATGTACTCTGCGTTGATACCACTGCTT |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CTACACGACGCTCTTCCGATCT |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output FLNC bam: test.fltnc.bam |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output summary json: test.fltnc.filter_summary.report.json

It is running now but here is what I see so far

Mar 21 '24 20:03 marcus-tutert

Here is the contents of my fasta file too (comes from the 10x 3' chemistry that is available on the isoseq website)

5p AAGCAGTGGTATCAACGCAGAGTACATGGG 3p AGATCGGAAGAGCGTCGTGTAG

Mar 21 '24 20:03 marcus-tutert

|> 20240321 19:54:37.853 -|- INFO -|- ParsePositionalArgs -|- 0x7fb0766860c0|| -|- Input Barcode file: /lustre/scratch126/humgen/projects/isogut/qc/primers.fasta |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Primer prefixes used to detect concatemers: |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AAGCAGTGGTATCAACGCAGAGTACATGGG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- AGATCGGAAGAGCGTCGTGTAG |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CCCATGTACTCTGCGTTGATACCACTGCTT |> 20240321 19:54:37.853 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CTACACGACGCTCTTCCGATCT |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output FLNC bam: test.fltnc.bam |> 20240321 19:54:37.855 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output summary json: test.fltnc.filter_summary.report.json

|> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Input reads : 7264302 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output reads : 19 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Output bases : 24777 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Filtered RQ reads : 0 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Missing RQ reads : 0 |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Run Time : 11m 944us |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- CPU Time : 9h 52m |> 20240321 20:05:38.595 -|- INFO -|- Runner -|- 0x7fb0766860c0|| -|- Peak RSS : 0.88908 GB

Here is my full log file

Mar 21 '24 20:03 marcus-tutert

And my reports: id,strand,fivelen,threelen,polyAlen,insertlen,primer m84093_240202_144742_s1/263197381/ccs,+,0,8,25,477,5p--3p m84093_240202_144742_s1/219156810/ccs,+,8,0,28,1451,5p--3p m84093_240202_144742_s1/204345334/ccs,+,8,0,25,1561,5p--3p m84093_240202_144742_s1/123997463/ccs,+,0,8,31,1619,5p--3p m84093_240202_144742_s1/112922188/ccs,+,0,8,41,375,5p--3p m84093_240202_144742_s1/109251907/ccs,+,0,9,31,2371,5p--3p m84093_240202_144742_s1/91559779/ccs,+,0,8,27,786,5p--3p m84093_240202_144742_s1/92542161/ccs,+,0,8,26,1685,5p--3p m84093_240202_144742_s1/75962115/ccs,+,8,0,26,1060,5p--3p m84093_240202_144742_s1/87888614/ccs,+,7,8,102,532,5p--3p m84093_240202_144742_s1/46536294/ccs,+,8,0,28,1906,5p--3p m84093_240202_144742_s1/25367936/ccs,+,8,0,29,1614,5p--3p m84093_240202_144742_s1/34669647/ccs,+,0,8,60,587,5p--3p m84093_240202_144742_s1/35783133/ccs,+,0,8,29,1043,5p--3p m84093_240202_144742_s1/65668734/ccs,+,8,0,31,1578,5p--3p m84093_240202_144742_s1/118557153/ccs,+,8,0,74,1614,5p--3p m84093_240202_144742_s1/119408775/ccs,+,0,8,27,847,5p--3p m84093_240202_144742_s1/111478958/ccs,+,8,0,22,1484,5p--3p m84093_240202_144742_s1/174523972/ccs,+,8,0,45,2187,5p--3p

{ "_comment": "Created by pbcopper v2.3.99", "attributes": [ { "id": "sample_name", "name": "Sample Name", "value": "Isogut14548275" }, { "id": "num_reads_fl", "name": "Full-Length Reads", "value": 7264302 }, { "id": "num_reads_flnc", "name": "Full-Length Non-Chimeric Reads", "value": 22 }, { "id": "num_reads_flnc_polya", "name": "Full-Length Non-Chimeric Reads with Poly-A Tail", "value": 19 } ], "dataset_uuids": [], "id": "isoseq_refine", "plotGroups": [], "tables": [], "title": "Iso-Seq Refine Report", "uuid": "c745eac5-a13f-4fdb-bd0f-2445e78ecb6b", "version": "1.0.1" }

Mar 21 '24 20:03 marcus-tutert

Hi @GelatinousGiant - Can you please give a bit more context? Is this using the MAS-Seq for 10x Single Cell 3' kit? Is this 10x 3' cDNA? Did you first run skera to get the segmented reads (S-reads) and use that as input to lima and isoseq refine?

Mar 25 '24 20:03 Magdoll

Closing due to lack of input

Jun 24 '24 15:06 armintoepfer

Hi @Magdoll and @armintoepfer,

I am new to PacBio long read sequencing. I received hifi_reads.fastq, and hifi_reads.bam files which I assume are the unaligned CCS reads, output of CCS tool. I believe our run was on MAS-Seq for 10x Single Cell 3' kit and the reads are the 10X cDNAs. I am facing the similar issue with @GelatinousGiant. I started off with 23G flt.bam ending up with 170K fltnc.bam. My CL:

$ isoseq refine m64410e.flt.bam primers.fasta m64410e.fltnc.bam --require-polya

I see that my flt.bam has about ~12K bases long reads with polyA tracks. Should I have ran skera split first to segment the reads? I went straight to lima with the hifi_reads.bam.

Thanks.

Aug 27 '24 20:08 TheCatalyticMicroChip

pbbioconda pbbioconda copied to clipboard

isoseq refine

pbbioconda
pbbioconda copied to clipboard