kb_python icon indicating copy to clipboard operation
kb_python copied to clipboard

`nac` workflow and BAM output in kb-python v0.30.0:

Open asmlgkj opened this issue 3 months ago • 5 comments

Hi, thanks for the great tool!

I have a few questions about the nac workflow and BAM output in kb-python v0.30.0:

Questions

  1. For RNA velocity analysis, should I use the nac workflow to build the reference index?

    I understand that RNA velocity requires separate quantification of spliced (mature) and unspliced (nascent) transcripts. Is the nac workflow the recommended approach for this?

  2. Does the nac workflow also support standard gene expression quantification?

    In other words, does nac encompass all the functionality of the standard workflow? Or do I need to build two separate indices if I want both standard counts and velocity-compatible counts?

  3. Can the BAM file generated by the latest kb-python be used for RNA velocity analysis?

    I noticed that recent versions support BAM output. Is this BAM file compatible with RNA velocity tools like scVelo or velocyto? Does it contain the necessary information to distinguish spliced/unspliced reads?

  4. Could you provide an example command for building a nac index and running count for RNA velocity analysis?

    For example, something like:

   # Build nac index
   kb ref --workflow nac \
       -i index.idx \
       -g t2g.txt \
       -f1 cdna.fa \
       -f2 nascent.fa \
       -c1 cdna_t2c.txt \
       -c2 nascent_t2c.txt \
       genome.fa genes.gtf

   # Count with nac workflow
   kb count --workflow nac \
       -i index.idx \
       -g t2g.txt \
       -c1 cdna_t2c.txt \
       -c2 nascent_t2c.txt \
       -x 10xv3 \
       -o output \
       R1.fastq.gz R2.fastq.gz

Is this correct? What additional parameters are needed for BAM output?

Environment

  • kb-python version: 0.30.0
  • OS: Linux (ubuntu)

Thanks in advance for your help!

asmlgkj avatar Dec 01 '25 08:12 asmlgkj

See our manual (being actively updated, so still a work-in-progress) here: https://github.com/Yenaled/kallisto-website

RNA velocity requires nac, nac can be used for standard gene expression analysis, BAM is never used for RNA velocity (count matrices are), see manual above, --genomebam is used for outputting bam files.

Yenaled avatar Dec 01 '25 09:12 Yenaled

@Yenaled Thanks for sharing the manual! I have a few questions:

Output format timing: Since both loom and h5ad (with spliced/unspliced counts) can be used as velocity input, which one takes longer to generate? Is there much overhead if generating both simultaneously? Layer naming in h5ad: I noticed the counts_unfiltered/adata.h5ad contains layers: 'ambiguous', 'mature', 'nascent'. Does 'mature' correspond to spliced and 'nascent' correspond to unspliced? Filtering: What's the recommended way to filter the counts_unfiltered output to get results comparable to 10x CellRanger's filtered output?

Thanks!

asmlgkj avatar Dec 01 '25 12:12 asmlgkj

another question, is there any docs about using for DNBelab C/C4 or BD single cell data, thanks a lot

asmlgkj avatar Dec 01 '25 22:12 asmlgkj

h5ad is shorter to generate. You need to add mature+ambiguous together (to make “spliced”), and the nascent is the “unspliced”, to do what cellranger does. See the tutorial on that. For filtering, I recommend making a knee plot as done in the tutorials; if you need to have some automatic filtering, kb-python’s default filtering works wells. No, while kb-python can handle those technologies, there are no docs for BD.

Yenaled avatar Dec 01 '25 23:12 Yenaled

@Yenaled Thanks a lot.

I encountered an error when using --report flag with kb count for the nac workflow. Additionally, I would like to clarify the expected behavior when combining --cellranger and --sum parameters.

Environment

  • kb-python version: (v0.30.0)
  • Python version: 3.11
  • OS: Linux (Ubuntu)

Bug Report: --report KeyError

Command

kb count -i index.idx \
    -g t2g.txt \
    -c1 cdna_t2c.txt \
    -c2 nascent_t2c.txt \
    -x 10xv3 \
    -o output_dir \
    -t 96 \
    --workflow nac \
    --h5ad \
    --cellranger \
    --report \
    sample_R1.fastq.gz sample_R2.fastq.gz

Error

The count matrices were generated successfully, but the report generation failed:

[2025-12-02 13:57:29,214]    INFO [count_nac] Writing report Jupyter notebook at ./report.ipynb and rendering it to ./report.html
[2025-12-02 13:57:29,215]    INFO [count_nac] Writing report Jupyter notebook at ./report.processed.ipynb and rendering it to ./report.processed.html
[2025-12-02 13:57:29,215]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File ".../kb_python/main.py", line 1977, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File ".../kb_python/main.py", line 600, in parse_count
    count_nac(
  File ".../kb_python/count.py", line 2472, in count_nac
    unfiltered_results[prefix][f'inspect{suffix}'],
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 'inspect'

Expected behavior

The --report flag should generate HTML reports without errors after successful matrix generation.


Clarification Questions: --cellranger and --sum behavior

Question 1: --cellranger output

When using --cellranger with --workflow nac, the output includes:

counts_unfiltered/
├── adata.h5ad
├── spliced/
│   ├── matrix.mtx.gz
│   ├── barcodes.tsv.gz
│   └── genes.tsv.gz
├── unspliced/
│   └── ...
└── cellranger_ambiguous/
    └── ...

Question: Are the spliced/ and unspliced/ directories equivalent to mature and nascent layers in adata.h5ad? Or does --cellranger automatically perform the summation (i.e., spliced = mature + ambiguous)?

Question 2: --sum parameter

The documentation states:

--sum TYPE: Use `cell` to add ambiguous and processed transcript matrices.
            Use `nucleus` to add ambiguous and unprocessed transcript matrices.

Question: When using --sum cell:

  1. Does this modify the spliced layer in adata.h5ad to be mature + ambiguous?
  2. Does this also affect the spliced/ directory when combined with --cellranger?

Question 3: Recommended workflow for RNA velocity

For scRNA-seq (single-cell RNA-seq)

Which approach is recommended?

Option A: No flags, manual processing

kb count ... --workflow nac --h5ad
adata.layers['spliced'] = adata.layers['mature'] + adata.layers['ambiguous']
adata.layers['unspliced'] = adata.layers['nascent']

Option B: Use --sum cell

kb count ... --workflow nac --h5ad --sum cell

Are Options A and B expected to produce identical results?

For snRNA-seq (single-nucleus RNA-seq)

Which approach is recommended?

Option C: No flags, manual processing

kb count ... --workflow nac --h5ad
adata.layers['spliced'] = adata.layers['mature']
adata.layers['unspliced'] = adata.layers['nascent'] + adata.layers['ambiguous']

Option D: Use --sum nucleus

kb count ... --workflow nac --h5ad --sum nucleus

Are Options C and D expected to produce identical results?


Summary of Questions

  1. Is the --report KeyError a known bug?
  2. Does --cellranger alone perform any summation, or just reorganizes the output format?
  3. For scRNA-seq velocity analysis: Is manual processing (mature + ambiguous) equivalent to using --sum cell?
  4. For snRNA-seq velocity analysis: Is manual processing (nascent + ambiguous) equivalent to using --sum nucleus?
  5. When using both --cellranger and --sum together, does the spliced/ directory contain the summed values?

Thank you for your help!

asmlgkj avatar Dec 02 '25 06:12 asmlgkj

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Jan 02 '26 00:01 github-actions[bot]

@Yenaled do you have time to have a lokk. thanks a lot

asmlgkj avatar Jan 02 '26 02:01 asmlgkj

--report is a bug, mature+ambiguous manual processing is equivalent to --sum=cell (and likewise for --sum nucleus being nascent+ambiguous), for --cellranger no.

Yenaled avatar Jan 02 '26 05:01 Yenaled

@Yenaled thanks a lot, I found kb output many more cells than cellranger, maybe 10 times, if I generate loom by kb, will it be much different than cellranger_loompy

asmlgkj avatar Jan 03 '26 00:01 asmlgkj

You'll have to filter the cells for the ones that have sufficient total UMI counts. Then the results should be similar.

Yenaled avatar Jan 03 '26 00:01 Yenaled

Have you conducted similar tests? If so, do you have any empirical parameter recommendations? Thank you.

asmlgkj avatar Jan 03 '26 00:01 asmlgkj

You can either inspect the knee plot and look for the inflection point or you can try running kb-python with --filter=bustools

Yenaled avatar Jan 03 '26 00:01 Yenaled

Thank you

asmlgkj avatar Jan 03 '26 00:01 asmlgkj