kb_python icon indicating copy to clipboard operation
kb_python copied to clipboard

Potential problem with kb-python;

Open fernando-pierucci-alves opened this issue 1 year ago • 6 comments

Describe the issue I am executing this command as per 'box 2' in Nat. Biot. 2024 article: ~$ kb ref -i fpa_index.idx -g fpa_t2g.txt -f1 fpa_cdna.fasta cds_from_genomic.fasta genomic.gtf

What is the exact command that was run?

kb count ...

Command output (with --verbose flag)

Please copy and paste *VERBOSE* output here.

fernando-pierucci-alves avatar Oct 30 '24 21:10 fernando-pierucci-alves

?

Yenaled avatar Oct 30 '24 22:10 Yenaled

Hi Yenaled,

I made a mistake in sending before completing. Again:

I am executing this command as per 'box 2' in Nat. Biot. 2024 article: ~$ kb ref -i fpa_index.idx -g fpa_t2g.txt -f1 fpa_cdna.fasta cds_from_genomic.fasta genomic.gtf

The fasta and gtf are the latest human NCBI files.

At the end of the indexing process, I am getting this output in terminal:

4. [2024-10-30 13:18:19,235] WARNING [main] Gene KIR3DP1_34has no transcripts. The entire gene will be marked as a transcript and an exon with IDKIR3DP1_34. [2024-10-30 13:18:19,235] WARNING [main] Gene KIR2DP1_36has no transcripts. The entire gene will be marked as a transcript and an exon with IDKIR2DP1_36`. [2024-10-30 13:18:23,996] INFO [ref] Splitting genome cds_from_genomic.fasta into cDNA at /home/fpa6/tmp/tmpxjhvjtxe [2024-10-30 14:53:01,887] INFO [ref] Concatenating 1 cDNAs to fpa_cdna.fasta [2024-10-30 14:53:01,887] INFO [ref] Creating transcript-to-gene mapping at fpa_t2g.txt [2024-10-30 14:53:01,888] INFO [ref] Indexing fpa_cdna.fasta to fpa_index.idx [2024-10-30 14:53:02,989] ERROR [ref] Warning: you asked for 8, but only 4 cores on the machine [build] loading fasta file fpa_cdna.fasta [build] k-mer length: 31 KmerStream::KmerStream(): Input file tmp/kallisto.XEaFUK6CL0C03cU43RFyOti does not exist, is ill-formed or is not in FASTA/FASTQ/GFA format. [2024-10-30 14:53:02,989] ERROR [main] An exception occurred Traceback (most recent call last): File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/main.py", line 1926, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/main.py", line 359, in parse_ref ref( File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/ngs_tools/logging.py", line 62, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/ref.py", line 689, in ref ) if n > 1 else kallisto_index( ^^^^^^^^^^^^^^^ File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/ref.py", line 291, in kallisto_index run_executable(command) File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/dry/init.py", line 25, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/utils.py", line 203, in run_executable raise sp.CalledProcessError(p.returncode, ' '.join(command)) subprocess.CalledProcessError: Command '/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/bins/linux/kallisto/kallisto index -i fpa_index.idx -k 31 -t 8 -d cds_from_genomic.fasta fpa_cdna.fasta' returned non-zero exit status 1.

The three output files are produced but they are empty.

Can you please provide insight? Thank you in advance and congrats on this overall scientific work.

Fernando

fernando-pierucci-alves avatar Oct 30 '24 22:10 fernando-pierucci-alves

cds_from_genomic.fasta is not the whole genome; it's just the CDS. You need to supply a whole genome FASTA.

Yenaled avatar Oct 30 '24 22:10 Yenaled

The article states...

"The default is --workflow=standard, which creates an index suitable for bulk and single-cell RNA-seq quantification. It creates an index built from only the coding DNA sequences (the usage of coding DNA here follows that of Ensembl30, i.e., the sequences of the mature transcripts wherein introns are not included as they have been spliced out)."

Am I misinterpreting the above? Thanks.

fernando-pierucci-alves avatar Oct 30 '24 22:10 fernando-pierucci-alves

It builds an index from coding sequences; that doesn’t mean you supply the coding sequences. kb ref extracts the coding sequences for you from the genomic FASTA. That’s why the paper says “For both the standard and nac index types, a user supplies a genome FASTA and GTF annotation, which kb-python uses to extract the relevant sequences”.

Yenaled avatar Oct 30 '24 22:10 Yenaled

That should solve the problem. Thanks again,

FPA

fernando-pierucci-alves avatar Oct 30 '24 22:10 fernando-pierucci-alves

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Nov 30 '24 00:11 github-actions[bot]