sc_long_pipeline.py--> ValueError: invalid contig `chr1`
Hi, both my genome.fa and gff3 files use contig chr1. Is there support for this format or parameters I can set to solve this error?
Traceback (most recent call last): File "PATH/TO/FLAMES/python/sc_long_pipeline.py", line 240, in
sc_long_pipeline(args) File "PATH/TO/FLAMES/python/sc_long_pipeline.py", line 193, in sc_long_pipeline raw_gff3=raw_splice_isoform if config_dict["global_parameters"]["generate_raw_isoform"] else None) File "PATH/TO/FLAMES/python/sc_longread.py", line 1123, in group_bam2isoform it_region = bamfile.fetch(ch, bl.s, bl.e) File "pysam/libcalignmentfile.pyx", line 1081, in pysam.libcalignmentfile.AlignmentFile.fetch File "pysam/libchtslib.pyx", line 686, in pysam.libchtslib.HTSFile.parse_region ValueError: invalid contig `chr1
Oh I think it is a pysam error which happens when reading the bam file. It is likely that your bam file is generated with a different reference without chr1 or there is no chr1 reads in your bam files so it give this error. you can check your bam file header using samtools view -H $your_bam_file.
Here's my output. There seem to be chr1 reads.
@HD VN:1.6 SO:coordinate @SQ SN:chr1 LN:248956422 @SQ SN:chr10 LN:133797422 @SQ SN:chr11 LN:135086622 @SQ SN:chr12 LN:133275309 @SQ SN:chr13 LN:114364328 @SQ SN:chr14 LN:107043718 @SQ SN:chr15 LN:101991189 @SQ SN:chr16 LN:90338345 @SQ SN:chr17 LN:83257441 @SQ SN:chr18 LN:80373285 @SQ SN:chr19 LN:58617616 @SQ SN:chr2 LN:242193529 @SQ SN:chr20 LN:64444167 @SQ SN:chr21 LN:46709983 @SQ SN:chr22 LN:50818468 @SQ SN:chr3 LN:198295559 @SQ SN:chr4 LN:190214555 @SQ SN:chr5 LN:181538259 @SQ SN:chr6 LN:170805979 @SQ SN:chr7 LN:159345973 @SQ SN:chr8 LN:145138636 @SQ SN:chr9 LN:138394717 @SQ SN:chrM LN:16569 @SQ SN:chrX LN:156040895 @SQ SN:chrY LN:57227415 @SQ SN:KI270728.1 LN:1872759 @SQ SN:KI270727.1 LN:448248 @SQ SN:KI270442.1 LN:392061 @SQ SN:KI270729.1 LN:280839 @SQ SN:GL000225.1 LN:211173 @SQ SN:KI270743.1 LN:210658 @SQ SN:GL000008.2 LN:209709 @SQ SN:GL000009.2 LN:201709 @SQ SN:KI270747.1 LN:198735 @SQ SN:KI270722.1 LN:194050 @SQ SN:GL000194.1 LN:191469 @SQ SN:KI270742.1 LN:186739 @SQ SN:GL000205.2 LN:185591 @SQ SN:GL000195.1 LN:182896 @SQ SN:KI270736.1 LN:181920 @SQ SN:KI270733.1 LN:179772 @SQ SN:GL000224.1 LN:179693 @SQ SN:GL000219.1 LN:179198 @SQ SN:KI270719.1 LN:176845 @SQ SN:GL000216.2 LN:176608 @SQ SN:KI270712.1 LN:176043 @SQ SN:KI270706.1 LN:175055 @SQ SN:KI270725.1 LN:172810 @SQ SN:KI270744.1 LN:168472 @SQ SN:KI270734.1 LN:165050 @SQ SN:GL000213.1 LN:164239 @SQ SN:GL000220.1 LN:161802 @SQ SN:KI270715.1 LN:161471 @SQ SN:GL000218.1 LN:161147 @SQ SN:KI270749.1 LN:158759 @SQ SN:KI270741.1 LN:157432 @SQ SN:GL000221.1 LN:155397 @SQ SN:KI270716.1 LN:153799 @SQ SN:KI270731.1 LN:150754 @SQ SN:KI270751.1 LN:150742 @SQ SN:KI270750.1 LN:148850 @SQ SN:KI270519.1 LN:138126 @SQ SN:GL000214.1 LN:137718 @SQ SN:KI270708.1 LN:127682 @SQ SN:KI270730.1 LN:112551 @SQ SN:KI270438.1 LN:112505 @SQ SN:KI270737.1 LN:103838 @SQ SN:KI270721.1 LN:100316 @SQ SN:KI270738.1 LN:99375 @SQ SN:KI270748.1 LN:93321 @SQ SN:KI270435.1 LN:92983 @SQ SN:GL000208.1 LN:92689 @SQ SN:KI270538.1 LN:91309 @SQ SN:KI270756.1 LN:79590 @SQ SN:KI270739.1 LN:73985 @SQ SN:KI270757.1 LN:71251 @SQ SN:KI270709.1 LN:66860 @SQ SN:KI270746.1 LN:66486 @SQ SN:KI270753.1 LN:62944 @SQ SN:KI270589.1 LN:44474 @SQ SN:KI270726.1 LN:43739 @SQ SN:KI270735.1 LN:42811 @SQ SN:KI270711.1 LN:42210 @SQ SN:KI270745.1 LN:41891 @SQ SN:KI270714.1 LN:41717 @SQ SN:KI270732.1 LN:41543 @SQ SN:KI270713.1 LN:40745 @SQ SN:KI270754.1 LN:40191 @SQ SN:KI270710.1 LN:40176 @SQ SN:KI270717.1 LN:40062 @SQ SN:KI270724.1 LN:39555 @SQ SN:KI270720.1 LN:39050 @SQ SN:KI270723.1 LN:38115 @SQ SN:KI270718.1 LN:38054 @SQ SN:KI270317.1 LN:37690 @SQ SN:KI270740.1 LN:37240 @SQ SN:KI270755.1 LN:36723 @SQ SN:KI270707.1 LN:32032 @SQ SN:KI270579.1 LN:31033 @SQ SN:KI270752.1 LN:27745 @SQ SN:KI270512.1 LN:22689 @SQ SN:KI270322.1 LN:21476 @SQ SN:GL000226.1 LN:15008 @SQ SN:KI270311.1 LN:12399 @SQ SN:KI270366.1 LN:8320 @SQ SN:KI270511.1 LN:8127 @SQ SN:KI270448.1 LN:7992 @SQ SN:KI270521.1 LN:7642 @SQ SN:KI270581.1 LN:7046 @SQ SN:KI270582.1 LN:6504 @SQ SN:KI270515.1 LN:6361 @SQ SN:KI270588.1 LN:6158 @SQ SN:KI270591.1 LN:5796 @SQ SN:KI270522.1 LN:5674 @SQ SN:KI270507.1 LN:5353 @SQ SN:KI270590.1 LN:4685 @SQ SN:KI270584.1 LN:4513 @SQ SN:KI270320.1 LN:4416 @SQ SN:KI270382.1 LN:4215 @SQ SN:KI270468.1 LN:4055 @SQ SN:KI270467.1 LN:3920 @SQ SN:KI270362.1 LN:3530 @SQ SN:KI270517.1 LN:3253 @SQ SN:KI270593.1 LN:3041 @SQ SN:KI270528.1 LN:2983 @SQ SN:KI270587.1 LN:2969 @SQ SN:KI270364.1 LN:2855 @SQ SN:KI270371.1 LN:2805 @SQ SN:KI270333.1 LN:2699 @SQ SN:KI270374.1 LN:2656 @SQ SN:KI270411.1 LN:2646 @SQ SN:KI270414.1 LN:2489 @SQ SN:KI270510.1 LN:2415 @SQ SN:KI270390.1 LN:2387 @SQ SN:KI270375.1 LN:2378 @SQ SN:KI270420.1 LN:2321 @SQ SN:KI270509.1 LN:2318 @SQ SN:KI270315.1 LN:2276 @SQ SN:KI270302.1 LN:2274 @SQ SN:KI270518.1 LN:2186 @SQ SN:KI270530.1 LN:2168 @SQ SN:KI270304.1 LN:2165 @SQ SN:KI270418.1 LN:2145 @SQ SN:KI270424.1 LN:2140 @SQ SN:KI270417.1 LN:2043 @SQ SN:KI270508.1 LN:1951 @SQ SN:KI270303.1 LN:1942 @SQ SN:KI270381.1 LN:1930 @SQ SN:KI270529.1 LN:1899 @SQ SN:KI270425.1 LN:1884 @SQ SN:KI270396.1 LN:1880 @SQ SN:KI270363.1 LN:1803 @SQ SN:KI270386.1 LN:1788 @SQ SN:KI270465.1 LN:1774 @SQ SN:KI270383.1 LN:1750 @SQ SN:KI270384.1 LN:1658 @SQ SN:KI270330.1 LN:1652 @SQ SN:KI270372.1 LN:1650 @SQ SN:KI270548.1 LN:1599 @SQ SN:KI270580.1 LN:1553 @SQ SN:KI270387.1 LN:1537 @SQ SN:KI270391.1 LN:1484 @SQ SN:KI270305.1 LN:1472 @SQ SN:KI270373.1 LN:1451 @SQ SN:KI270422.1 LN:1445 @SQ SN:KI270316.1 LN:1444 @SQ SN:KI270340.1 LN:1428 @SQ SN:KI270338.1 LN:1428 @SQ SN:KI270583.1 LN:1400 @SQ SN:KI270334.1 LN:1368 @SQ SN:KI270429.1 LN:1361 @SQ SN:KI270393.1 LN:1308 @SQ SN:KI270516.1 LN:1300 @SQ SN:KI270389.1 LN:1298 @SQ SN:KI270466.1 LN:1233 @SQ SN:KI270388.1 LN:1216 @SQ SN:KI270544.1 LN:1202 @SQ SN:KI270310.1 LN:1201 @SQ SN:KI270412.1 LN:1179 @SQ SN:KI270395.1 LN:1143 @SQ SN:KI270376.1 LN:1136 @SQ SN:KI270337.1 LN:1121 @SQ SN:KI270335.1 LN:1048 @SQ SN:KI270378.1 LN:1048 @SQ SN:KI270379.1 LN:1045 @SQ SN:KI270329.1 LN:1040 @SQ SN:KI270419.1 LN:1029 @SQ SN:KI270336.1 LN:1026 @SQ SN:KI270312.1 LN:998 @SQ SN:KI270539.1 LN:993 @SQ SN:KI270385.1 LN:990 @SQ SN:KI270423.1 LN:981 @SQ SN:KI270392.1 LN:971 @SQ SN:KI270394.1 LN:970 @PG ID:minimap2 PN:minimap2 VN:2.24-r1122 CL:minimap2 -ax splice -k14 -t 1 --junc-bed /PATH/TO/gencode.v39.annotation.sorted.bed --secondary=no --sam-hit-only /PATH/TO/genome.fa PATH/TO//RPL41_204_205_5.fq.gz @PG ID:samtools PN:samtools PP:minimap2 VN:1.15.1 CL:samtools view -hb PATH/TO//RPL41_204_205_5.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.15.1 CL:samtools sort -o PATH/TO//RPL41_204_205_5.sort.bam PATH/TO//RPL41_204_205_5.bam
Hello,
Here's the structure of our input fastq from the match_cell_barcode step:

Here's the BAM file the full pipeline starts to output:

The transcript ID replaced the chromosome ID in the name file, and there's no chromosome contig. Also there's no read name, just a cell barcode as a read name in the bam file.
Usage for sc_long_pipeline.py:
~/FLAMES/python/sc_long_pipeline.py --gff3 ~/gencode.v39.annotation.gff3 --infq ~/out_pcr5_hq_fastq.gz --outdir ~/flames_test_MYL6/ --genomefa ~/gencode.v39.transcripts.fa --minimap2_dir ~/anaconda3/envs/FLAMES3/bin/
Original fastq before barcode match:

The genomefa should be the fasta file of genome instead of transcript. For human the most recent primary assembly should be: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.primary_assembly.genome.fa.gz. other versions have similar naming convention.
The FLAMES software will generate the transcript fasta file, which contains the known transcript and the new transcript called from the bam file.
I see, thanks. With the updated input, I get a new error:
Traceback (most recent call last): File "~FLAMES_analysis/FLAMES/python/count_tr.py", line 159, in parse_realigned_bam bc, umi = r.split("#")[0].split("_") # assume cleaned barcode ValueError: not enough values to unpack (expected 2, got 1)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "~FLAMES/python/sc_long_pipeline.py", line 240, in
sc_long_pipeline(args) File "~FLAMES/python/sc_long_pipeline.py", line 225, in sc_long_pipeline realign_bam, transcript_fa_idx, config_dict["isoform_parameters"]["Min_sup_cnt"], config_dict["transcript_counting"]["min_tr_coverage"], config_dict["transcript_counting"]["min_read_coverage"]) File "~FLAMES/python/count_tr.py", line 163, in parse_realigned_bam "Please check if barcode and UMI are delimited by "_"") ValueError: Please check if barcode and UMI are delimited by "__"
Our barcode is the first 16nt and the umi is the next 10nt that follows.
The fastq I provided for --infq is the binary fastq file from the first barcode matching step. Can you also confirm this should be a binary file generated from match_cell_barcode?
Yes, it is expecting fastq files outputted by match_cell_barcode, where the identifier field has the format of @[barcode]_[umi]#[original id]
But this fastq file couldn't be viewed by less, it's a biniary file. And it couldn't be aligned with minimap2 either, here is the error log:
Input parameters:
gene annotation: /home/reference/annotation/human/gencode.v39.annotation.gff3
genome fasta: /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa
input fastq: /home/FLAMES/out.fastq
output directory: /home/FLAMES/
directory contains minimap2: /home/software/minimap2-2.24_x64-linux/
### align reads to genome using minimap2 2023-02-03 11:25:22
b''
Traceback (most recent call last):
File "./python/sc_long_pipeline.py", line 240, in <module>
sc_long_pipeline(args)
File "./python/sc_long_pipeline.py", line 168, in sc_long_pipeline
seed=config_dict["alignment_parameters"]["seed"])
File "/home/software/FLAMES/python/minimap2_align.py", line 41, in minimap2_align
shell=True, stderr=subprocess.STDOUT))
File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/software/minimap2-2.24_x64-linux/minimap2 -ax splice -t 12 --junc-bed /home/FLAMES/tmp.splice_anno.bed12 --junc-bonus 1 -k14 --secondary=no --seed 2022 /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa /home/FLAMES//out.fastq | samtools view -bS -@ 4 -m 2G -o /home/FLAMES/tmp.alig
n.bam - ']' returned non-zero exit status 1.
Here is the output statistic for match_cell_barcode, which may help:
###total read: 26292
###barcode hm match: 13191
###barcode fuzzy match: 2127
###barcode not match: 10974
###too short: 0
@dontwantcode there seem to be an extra / in the minimap code: /home/FLAMES//out.fastq
subprocess.CalledProcessError: Command '['/home/software/minimap2-2.24_x64-linux/minimap2 -ax splice -t 12 --junc-bed /home/FLAMES/tmp.splice_anno.bed12 --junc-bonus 1 -k14 --secondary=no --seed 2022 /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa /home/FLAMES//out.fastq | samtools view -bS -@ 4 -m 2G -o /home/FLAMES/tmp.alig
n.bam - ']'
Hi Luyi,
I don't think it should be the problem of \\, which is equivalent as \ in linux path.
Here is a verification:
Input parameters:
gene annotation: /home/reference/annotation/human/gencode.v39.annotation.gff3
genome fasta: /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa
input fastq: /home/FLAMES/out.fastq
output directory: /home/FLAMES
directory contains minimap2: /home/software/minimap2-2.24_x64-linux/
### align reads to genome using minimap2 2023-02-11 22:09:24
b''
Traceback (most recent call last):
File "./python/sc_long_pipeline.py", line 240, in <module>
sc_long_pipeline(args)
File "./python/sc_long_pipeline.py", line 168, in sc_long_pipeline
seed=config_dict["alignment_parameters"]["seed"])
File "/home/software/FLAMES/python/minimap2_align.py", line 41, in minimap2_align
shell=True, stderr=subprocess.STDOUT))
File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/software/minimap2-2.24_x64-linux/minimap2 -ax splice -t 12 --junc-bed /home/FLAMES/tmp.splice_anno.bed12 --junc-bonus 1 -k14 --secondary=no --seed 2022 /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa /home/FLAMES/out.fastq | samtools view -bS -@ 4 -m 2G -o /home/FLAMES/tmp.align
.bam - ']' returned non-zero exit status 1.
But this fastq file couldn't be viewed by
less, it's a biniary file. And it couldn't be aligned withminimap2either, here is the error log:Input parameters: gene annotation: /home/reference/annotation/human/gencode.v39.annotation.gff3 genome fasta: /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa input fastq: /home/FLAMES/out.fastq output directory: /home/FLAMES/ directory contains minimap2: /home/software/minimap2-2.24_x64-linux/ ### align reads to genome using minimap2 2023-02-03 11:25:22 b'' Traceback (most recent call last): File "./python/sc_long_pipeline.py", line 240, in <module> sc_long_pipeline(args) File "./python/sc_long_pipeline.py", line 168, in sc_long_pipeline seed=config_dict["alignment_parameters"]["seed"]) File "/home/software/FLAMES/python/minimap2_align.py", line 41, in minimap2_align shell=True, stderr=subprocess.STDOUT)) File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 411, in check_output **kwargs).stdout File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['/home/software/minimap2-2.24_x64-linux/minimap2 -ax splice -t 12 --junc-bed /home/FLAMES/tmp.splice_anno.bed12 --junc-bonus 1 -k14 --secondary=no --seed 2022 /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa /home/FLAMES//out.fastq | samtools view -bS -@ 4 -m 2G -o /home/FLAMES/tmp.alig n.bam - ']' returned non-zero exit status 1.Here is the output statistic for
match_cell_barcode, which may help:###total read: 26292 ###barcode hm match: 13191 ###barcode fuzzy match: 2127 ###barcode not match: 10974 ###too short: 0
Should be a gzipped fastq, could you check with zcat? Or maybe post the first line of xxd out.fastq