perbase icon indicating copy to clipboard operation
perbase copied to clipboard

weird encoding for reference name

Open twaddlac opened this issue 2 years ago • 8 comments

I'm trying to run perbase depth-only in.bam and I am getting the following output:

image

The bam file isn't corrupt (according to samtools) and works with other parts of my pipeline.

This reproduced when running base-depth as well, and replicated by other users.

Running perbase v0.8.5

Please let me know if I can provide any more information!

twaddlac avatar Jun 26 '23 16:06 twaddlac

Hi! could you share the bamfile / part of the bamfile that leads to this? Or paste in the bam header output from samtools when used on that bamfile as well.

sstadick avatar Oct 09 '23 19:10 sstadick

I'm experiencing the same issue with a few different references I'm trying to get perbase to work with:

REF     POS     DEPTH   A       C       G       T       N       INS     DEL     REF_SKIP        FAIL    NEAR_MAX_DEPTH
^@^@^@^@^@^@^@^P<9B>^C<F0><FF>^?^@^@!   9       26      0       0       22      0       4       0       0       0       0       false
^@^@^@^@^@^@^@<E0>}^A<F0><FF>^?^@^@!    10      98      0       0       0       83      15      0       0       0       0       false
^@^@^@^@^@^@^@^P~^A<F0><FF>^?^@^@!      11      154     0       123     0       0       31      0       0       0       0       false
^@^@^@^@^@^@^@p<E1>^@<F0><FF>^?^@^@!    12      161     0       0       0       123     38      0       0       0       0       false
^@^@^@^@^@^@^@<80><CC>^@<F0><FF>^?^@^@! 13      167     134     0       0       0       33      0       0       0       0       false
^@^@^@^@^@^@^@`%^E<F0><FF>^?^@^@!       14      675     0       546     0       1       128     0       0       0       0       false
^@^@^@^@^@^@^@^@<B9>^C<F0><FF>^?^@^@!   15      716     0       0       581     0       135     0       0       0       0       false
^@^@^@^@^@^@^@<E0>:^E<F0><FF>^?^@^@!    16      782     0       0       1       588     193     0       0       0       0       false
^@^@^@^@^@^@^@^P;^E<F0><FF>^?^@^@!      17      870     0       1       556     0       313     0       0       0       0       false
^@^@^@^@^@^@^@^P^U^L<F0><FF>^?^@^@!     18      940     0       0       710     0       230     0       0       0       0       false
^@^@^@^@^@^@^@^P^F^L<F0><FF>^?^@^@!     19      973     832     0       0       0       141     0       0       0       0       false

Here is my bam file header:

@HD	VN:1.6	SO:coordinate
@SQ	SN:DENV3-E-W_Min_preMVS_P3_CDX430_v2	LN:10708
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:bwa mem -t 96 /references/E3093-3_S3_ref.fasta /bbmerge_out/E3093-3_S3_merged_perfect.fastq
@PG	ID:samtools	PN:samtools	PP:bwa	VN:1.18	CL:samtools view -bS -
@PG	ID:samtools.1	PN:samtools	PP:samtools	VN:1.18	CL:samtools sort -@ 96 --output-fmt BAM -o ../bwa-mem_out/bbmerged/E3093-3_S3.sorted.bam ../bwa-mem_out/bbmerged/E3093-3_S3.bam
@PG	ID:MarkDuplicates	VN:3.1.0	CL:MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=10000 INPUT=[../bwa-mem_out/bbmerged/E3093-3_S3.sorted.bam] OUTPUT=../bwa-mem_out/bbmerged/E3093-3_S3.sorted.markdup.bam METRICS_FILE=../bwa-mem_out/bbmerged/E3093-3_S3.sorted.markdup.metrics REMOVE_DUPLICATES=false ASSUME_SORTED=true TMP_DIR=[/gpfs/projects/GenomicsCore/codagenix/scripts/tmp] VALIDATION_STRINGENCY=LENIENT    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_QUALITY_SUM_STRATEGY=false USE_END_IN_UNPAIRED_READS=false USE_UNPAIRED_CLIPPED_END=false UNPAIRED_END_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false	PN:MarkDuplicates
@PG	ID:samtools.2	PN:samtools	PP:samtools.1	VN:1.18	CL:samtools view -H E3093-3_S3.sorted.markdup.bam

I've also noticed that if I supply the reference fasta file in my perbase command, I get the following error:

[2023-11-13T19:38:56Z INFO  perbase_lib::par_granges] Using 94 worker threads.
[2023-11-13T19:38:56Z INFO  perbase_lib::par_granges] Creating channel of length 189246910 (* 120 bytes to get mem)
[2023-11-13T19:39:04Z INFO  perbase_lib::par_granges] Reading from "bwa-mem_out/bbmerged/E3093-3_S3.sorted.markdup.bam"
[2023-11-13T19:39:04Z INFO  perbase_lib::par_granges] Processing TID 0:0-10708
thread '<unnamed>' panicked at 'Fetched reference sequence: Unknown sequence name: ���!.', src/commands/base_depth.rs:275:30
stack backtrace:
   0:     0x55555575ea10 - std::backtrace_rs::backtrace::libunwind::trace::ha9053a9a07ca49cb
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55555575ea10 - std::backtrace_rs::backtrace::trace_unsynchronized::h9c2852a457ad564e
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55555575ea10 - std::sys_common::backtrace::_print_fmt::h457936fbfaa0070f
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55555575ea10 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5779d7bf7f70cb0c
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x5555556b9bfe - core::fmt::write::h5a4baaff1bcd3eb5
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/fmt/mod.rs:1232:17
   5:     0x555555738ea4 - std::io::Write::write_fmt::h4bc1f301cb9e9cce
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/io/mod.rs:1684:15
   6:     0x55555576023f - std::sys_common::backtrace::_print::h5fcdc36060f177e8
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x55555576023f - std::sys_common::backtrace::print::h54ca9458b876c8bf
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x55555575fe3f - std::panicking::default_hook::{{closure}}::hbe471161c7664ed6
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:271:22
   9:     0x555555760e58 - std::panicking::default_hook::ha3500da57aa4ac4f
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:290:9
  10:     0x555555760e58 - std::panicking::rust_panic_with_hook::h50c09d000dc561d2
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:692:13
  11:     0x555555760914 - std::panicking::begin_panic_handler::{{closure}}::h9e2b2176e00e0d9c
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:583:13
  12:     0x555555760876 - std::sys_common::backtrace::__rust_end_short_backtrace::h5739b8e512c09d02
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:150:18
  13:     0x555555760861 - rust_begin_unwind
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
  14:     0x5555555d60b2 - core::panicking::panic_fmt::hf33a1475b4dc5c3e
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
  15:     0x5555555d63e2 - core::result::unwrap_failed::hdff5465d74574b44
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1750:5
  16:     0x5555555f669a - <core::iter::adapters::flatten::FlatMap<I,U,F> as core::iter::traits::iterator::Iterator>::next::h4ec7ffc6086dfa1a
  17:     0x555555624d34 - rayon::iter::plumbing::bridge_producer_consumer::helper::h40d19706cb30b716
  18:     0x5555555ef2ca - rayon_core::join::join_context::{{closure}}::he92c561eced12757
  19:     0x5555555e6b56 - rayon_core::thread_pool::ThreadPool::install::{{closure}}::h185c0d76b416e80d
  20:     0x555555632daf - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb3e400f8a6fd84fd
  21:     0x5555555da5ed - rayon_core::registry::WorkerThread::wait_until_cold::h441bdd8e1d8196e3
  22:     0x5555556dc9ba - std::sys_common::backtrace::__rust_begin_short_backtrace::h933f37f9b851e73f
  23:     0x5555556dc74d - core::ops::function::FnOnce::call_once{{vtable.shim}}::h1b5aaf29acd6a35f
  24:     0x555555761b45 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h39990b24eedef2ab
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:1987:9
  25:     0x555555761b45 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h01a027258444143b
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:1987:9
  26:     0x555555761b45 - std::sys::unix::thread::Thread::new::thread_start::ha4f1cdd9c25884ba
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys/unix/thread.rs:108:17
  27:     0x7ffff79af1cf - start_thread
  28:     0x7ffff7094e73 - clone
  29:                0x0 - <unknown>

As far as I can tell, there is nothing wrong with either my reference fasta file or the input bam file. Other programs (samtools, bwa, bbmap, etc.) don't report any issues with them.

Any idea what the problem could be? I'm happy to share the full fasta and/or bam files if it would be useful.

Best, Dave

davidecarlson avatar Nov 13 '23 19:11 davidecarlson

If you had a minimal example with a bamfile and reference that would be awesome!

I'm really not sure what the issue is. This function is what would generate the ref_seq name, which in these outputs looks very mangled (breadcrumb for myself looking at this again later)

sstadick avatar Nov 13 '23 21:11 sstadick

Sure thing. The reference is very small (~10 Kb) and I've extracted a subset of 10k BAM entries that produce the weird results.

Thanks for taking a look!

Best, Dave example.zip

davidecarlson avatar Nov 13 '23 21:11 davidecarlson

Quick update - I reinstalled the latest version of perbase, and I'm no longer seeing the same errors on the exact same reference fastas and bam files.

I don't know what caused the issue in the first place, but it's working now, so I'm happy! :)

Thanks! Dave

davidecarlson avatar Nov 14 '23 14:11 davidecarlson

@sstadick Sorry for the delay here! @davidecarlson thanks for chiming in as well. I installed 0.9.0 with no avail. I'll share the data here once I have clearance to do so (hopefully I will soon).

twaddlac avatar Nov 17 '23 20:11 twaddlac

@davidecarlson I'm glad that worked! @twaddlac 👍

sstadick avatar Nov 17 '23 22:11 sstadick