remora
remora copied to clipboard
remora 3.0: error when training on different canonical bases
Hi,
I got an error (c.f below) when running remora dataset prepare
using multiple focus bases. I already trained models in the same spirit using remora 2.0, that why I don't know if that's an expected behavior...
If this is expected, then how can I train models on a specific mod base taking into account that other base/context can be also methylated?
Also a more general question, what is the best practice to infer train remora models? Should I subset 10-15% of my training data for validation (and use the rest to train) or should I use everything to train and infer with the same dataset?
Thank you for your help
Remora command:
remora dataset prepare \
--output-path ${wd}data/0_unmeth/prepData/mock_5_CpG_6mA \
--refine-kmer-level-table ${wd}data/ONT/9mer_levels_v1.txt \
--refine-rough-rescale \
--motif CG 0 --motif A 0 \
--mod-base-control \
--max-chunks-per-read 20 \
--num-extract-alignment-workers 24 \
--num-extract-chunks-workers 24 \
${wd}data/0_unmeth/0_unmeth.pod5 \
${wd}data/0_unmeth/0_unmeth.pass.bam
Error log:
[14:37:43.988] Extracting read IDs from POD5
[14:37:49.204] Found 1,242,986 valid BAM records. Found signal in POD5 for 100.00% of BAM records.
[14:37:49.302] Making reference-anchored training data
[14:37:49.302] Opening dataset for output
Traceback (most recent call last):
File "/miniconda3/envs/remora_train/bin/remora", line 8, in <module>
sys.exit(run())
File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/main.py", line 71, in run
cmd_func(args)
File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/parsers.py", line 302, in run_dataset_prepare
extract_chunk_dataset(
File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/prepare_train_data.py", line 165, in extract_chunk_dataset
metadata=DatasetMetadata(
File "<string>", line 23, in __init__
File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/data_chunks.py", line 847, in __post_init__
self.check_motifs()
File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/data_chunks.py", line 824, in check_motifs
raise RemoraError(
remora.RemoraError: Cannot create dataset with multiple motif focus bases: {'A', 'C'}
Remora version:
> remora -v
Remora version: 3.0.0