remora icon indicating copy to clipboard operation
remora copied to clipboard

remora 3.0: error when training on different canonical bases

Open Mathias-Boulanger opened this issue 7 months ago • 9 comments

Hi,

I got an error (c.f below) when running remora dataset prepare using multiple focus bases. I already trained models in the same spirit using remora 2.0, that why I don't know if that's an expected behavior...

If this is expected, then how can I train models on a specific mod base taking into account that other base/context can be also methylated?

Also a more general question, what is the best practice to infer train remora models? Should I subset 10-15% of my training data for validation (and use the rest to train) or should I use everything to train and infer with the same dataset?

Thank you for your help

Remora command:

remora dataset prepare \
	--output-path ${wd}data/0_unmeth/prepData/mock_5_CpG_6mA \
	--refine-kmer-level-table ${wd}data/ONT/9mer_levels_v1.txt \
	--refine-rough-rescale \
	--motif CG 0 --motif A 0 \
	--mod-base-control \
	--max-chunks-per-read 20 \
	--num-extract-alignment-workers 24 \
	--num-extract-chunks-workers 24 \
	${wd}data/0_unmeth/0_unmeth.pod5 \
	${wd}data/0_unmeth/0_unmeth.pass.bam

Error log:

[14:37:43.988] Extracting read IDs from POD5
[14:37:49.204] Found 1,242,986 valid BAM records. Found signal in POD5 for 100.00% of BAM records.
[14:37:49.302] Making reference-anchored training data
[14:37:49.302] Opening dataset for output
Traceback (most recent call last):
  File "/miniconda3/envs/remora_train/bin/remora", line 8, in <module>
    sys.exit(run())
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/parsers.py", line 302, in run_dataset_prepare
    extract_chunk_dataset(
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/prepare_train_data.py", line 165, in extract_chunk_dataset
    metadata=DatasetMetadata(
  File "<string>", line 23, in __init__
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/data_chunks.py", line 847, in __post_init__
    self.check_motifs()
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/data_chunks.py", line 824, in check_motifs
    raise RemoraError(
remora.RemoraError: Cannot create dataset with multiple motif focus bases: {'A', 'C'}

Remora version:

> remora -v
Remora version: 3.0.0

Mathias-Boulanger avatar Nov 24 '23 13:11 Mathias-Boulanger