`nac` workflow and BAM output in kb-python v0.30.0:
Hi, thanks for the great tool!
I have a few questions about the nac workflow and BAM output in kb-python v0.30.0:
Questions
-
For RNA velocity analysis, should I use the
nacworkflow to build the reference index?I understand that RNA velocity requires separate quantification of spliced (mature) and unspliced (nascent) transcripts. Is the
nacworkflow the recommended approach for this? -
Does the
nacworkflow also support standard gene expression quantification?In other words, does
nacencompass all the functionality of thestandardworkflow? Or do I need to build two separate indices if I want both standard counts and velocity-compatible counts? -
Can the BAM file generated by the latest kb-python be used for RNA velocity analysis?
I noticed that recent versions support BAM output. Is this BAM file compatible with RNA velocity tools like scVelo or velocyto? Does it contain the necessary information to distinguish spliced/unspliced reads?
-
Could you provide an example command for building a
nacindex and running count for RNA velocity analysis?For example, something like:
# Build nac index
kb ref --workflow nac \
-i index.idx \
-g t2g.txt \
-f1 cdna.fa \
-f2 nascent.fa \
-c1 cdna_t2c.txt \
-c2 nascent_t2c.txt \
genome.fa genes.gtf
# Count with nac workflow
kb count --workflow nac \
-i index.idx \
-g t2g.txt \
-c1 cdna_t2c.txt \
-c2 nascent_t2c.txt \
-x 10xv3 \
-o output \
R1.fastq.gz R2.fastq.gz
Is this correct? What additional parameters are needed for BAM output?
Environment
- kb-python version: 0.30.0
- OS: Linux (ubuntu)
Thanks in advance for your help!
See our manual (being actively updated, so still a work-in-progress) here: https://github.com/Yenaled/kallisto-website
RNA velocity requires nac, nac can be used for standard gene expression analysis, BAM is never used for RNA velocity (count matrices are), see manual above, --genomebam is used for outputting bam files.
@Yenaled Thanks for sharing the manual! I have a few questions:
Output format timing: Since both loom and h5ad (with spliced/unspliced counts) can be used as velocity input, which one takes longer to generate? Is there much overhead if generating both simultaneously? Layer naming in h5ad: I noticed the counts_unfiltered/adata.h5ad contains layers: 'ambiguous', 'mature', 'nascent'. Does 'mature' correspond to spliced and 'nascent' correspond to unspliced? Filtering: What's the recommended way to filter the counts_unfiltered output to get results comparable to 10x CellRanger's filtered output?
Thanks!
another question, is there any docs about using for DNBelab C/C4 or BD single cell data, thanks a lot
h5ad is shorter to generate. You need to add mature+ambiguous together (to make “spliced”), and the nascent is the “unspliced”, to do what cellranger does. See the tutorial on that. For filtering, I recommend making a knee plot as done in the tutorials; if you need to have some automatic filtering, kb-python’s default filtering works wells. No, while kb-python can handle those technologies, there are no docs for BD.
@Yenaled Thanks a lot.
I encountered an error when using --report flag with kb count for the nac workflow. Additionally, I would like to clarify the expected behavior when combining --cellranger and --sum parameters.
Environment
- kb-python version: (v0.30.0)
- Python version: 3.11
- OS: Linux (Ubuntu)
Bug Report: --report KeyError
Command
kb count -i index.idx \
-g t2g.txt \
-c1 cdna_t2c.txt \
-c2 nascent_t2c.txt \
-x 10xv3 \
-o output_dir \
-t 96 \
--workflow nac \
--h5ad \
--cellranger \
--report \
sample_R1.fastq.gz sample_R2.fastq.gz
Error
The count matrices were generated successfully, but the report generation failed:
[2025-12-02 13:57:29,214] INFO [count_nac] Writing report Jupyter notebook at ./report.ipynb and rendering it to ./report.html
[2025-12-02 13:57:29,215] INFO [count_nac] Writing report Jupyter notebook at ./report.processed.ipynb and rendering it to ./report.processed.html
[2025-12-02 13:57:29,215] ERROR [main] An exception occurred
Traceback (most recent call last):
File ".../kb_python/main.py", line 1977, in main
COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
File ".../kb_python/main.py", line 600, in parse_count
count_nac(
File ".../kb_python/count.py", line 2472, in count_nac
unfiltered_results[prefix][f'inspect{suffix}'],
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 'inspect'
Expected behavior
The --report flag should generate HTML reports without errors after successful matrix generation.
Clarification Questions: --cellranger and --sum behavior
Question 1: --cellranger output
When using --cellranger with --workflow nac, the output includes:
counts_unfiltered/
├── adata.h5ad
├── spliced/
│ ├── matrix.mtx.gz
│ ├── barcodes.tsv.gz
│ └── genes.tsv.gz
├── unspliced/
│ └── ...
└── cellranger_ambiguous/
└── ...
Question: Are the spliced/ and unspliced/ directories equivalent to mature and nascent layers in adata.h5ad? Or does --cellranger automatically perform the summation (i.e., spliced = mature + ambiguous)?
Question 2: --sum parameter
The documentation states:
--sum TYPE: Use `cell` to add ambiguous and processed transcript matrices.
Use `nucleus` to add ambiguous and unprocessed transcript matrices.
Question: When using --sum cell:
- Does this modify the
splicedlayer inadata.h5adto bemature + ambiguous? - Does this also affect the
spliced/directory when combined with--cellranger?
Question 3: Recommended workflow for RNA velocity
For scRNA-seq (single-cell RNA-seq)
Which approach is recommended?
Option A: No flags, manual processing
kb count ... --workflow nac --h5ad
adata.layers['spliced'] = adata.layers['mature'] + adata.layers['ambiguous']
adata.layers['unspliced'] = adata.layers['nascent']
Option B: Use --sum cell
kb count ... --workflow nac --h5ad --sum cell
Are Options A and B expected to produce identical results?
For snRNA-seq (single-nucleus RNA-seq)
Which approach is recommended?
Option C: No flags, manual processing
kb count ... --workflow nac --h5ad
adata.layers['spliced'] = adata.layers['mature']
adata.layers['unspliced'] = adata.layers['nascent'] + adata.layers['ambiguous']
Option D: Use --sum nucleus
kb count ... --workflow nac --h5ad --sum nucleus
Are Options C and D expected to produce identical results?
Summary of Questions
- Is the
--reportKeyError a known bug? - Does
--cellrangeralone perform any summation, or just reorganizes the output format? - For scRNA-seq velocity analysis: Is manual processing (
mature + ambiguous) equivalent to using--sum cell? - For snRNA-seq velocity analysis: Is manual processing (
nascent + ambiguous) equivalent to using--sum nucleus? - When using both
--cellrangerand--sumtogether, does thespliced/directory contain the summed values?
Thank you for your help!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days
@Yenaled do you have time to have a lokk. thanks a lot
--report is a bug, mature+ambiguous manual processing is equivalent to --sum=cell (and likewise for --sum nucleus being nascent+ambiguous), for --cellranger no.
@Yenaled thanks a lot, I found kb output many more cells than cellranger, maybe 10 times, if I generate loom by kb, will it be much different than cellranger_loompy
You'll have to filter the cells for the ones that have sufficient total UMI counts. Then the results should be similar.
Have you conducted similar tests? If so, do you have any empirical parameter recommendations? Thank you.
You can either inspect the knee plot and look for the inflection point or you can try running kb-python with --filter=bustools
Thank you