remora
remora copied to clipboard
100% training accuracy - unable to navigate mistake
Hi Marcus,
I managed to make a dataset that has separate 10+ nucleotide handles on either side of the unmod G and mod G randomers with the structure: LeftHandle1-NNNNGNNNN-RightHandle1
and LeftHandle2-NNNNGNNNN-RightHandle2
.
And I'm trying to extract chunks from this dataset but I don't think I'm doing it right. Here is my code for extracting chunks:
remora dataset prepare pod5Dir/merged.pod5 bamDir/v02.2_trimmed_aligned_moves.bam --output-path $outDir/controlchunks --refine-kmer-level-table $outDir/9mer_levels_v2.txt --refine-rough-rescale --motif TNNNNGNNNNG 5 --mod-base-control --num-extract-chunks-workers 2
remora dataset prepare pod5Dir/merged.pod5 bamDir/v02.2_trimmed_aligned_moves.bam --output-path $outDir/modchunks --refine-kmer-level-table $outDir/9mer_levels_v2.txt --refine-rough-rescale --motif CNNNNGNNNNT 5 --mod-base o 8oxoG --num-extract-chunks-workers 2
Then I made a config and trained the model:
remora \
dataset make_config \
train_dataset.jsn \
controlchunks \
modchunks \
--dataset-weights 1 1 \
--log-filename train_dataset.log
remora \
model train \
train_dataset.jsn \
--model models/ConvLSTM_w_ref.py \
--device 0 \
--chunk-context 50 50 \
--output-path train_results
And this is where things start feeling fishy. Because the output from the training shows that the central position is 7. But it shouldn't be. The central position/focus base should be the G in the middle - which is on the 6th position (5, according to python).
Output:
(base) [moa42@cayuga-login err]$ cat remora_train_295err
[56] Seed selected is 92422
[5242] Loading dataset from Remora dataset config
[5] Dataset summary
size 26,59
modified_base_labels True
mod_bases ['o']
mod_long_names ['oxoG']
kmer_context_bases (4, 4)
chunk_context (5, 5)
motifs [('CNNNNGNNNNT', 5), ('TNNNNGNNNNG', 5)]
reverse_signal False
chunk_extract_base_start False
chunk_extract_offset
sig_map_refiner Loaded 9-mer table with central position Rough re-scaling will be executed
[5] Loading model
[646] Model structure
network(
(sig_conv) Convd(, 4, kernel_size=(5,), stride=(,))
(sig_bn) BatchNormd(4, eps=e-5, momentum=, affine=True, track_running_stats=True)
(sig_conv2) Convd(4, 6, kernel_size=(5,), stride=(,))
(sig_bn2) BatchNormd(6, eps=e-5, momentum=, affine=True, track_running_stats=True)
(sig_conv) Convd(6, 64, kernel_size=(9,), stride=(,))
(sig_bn) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
(seq_conv) Convd(6, 6, kernel_size=(5,), stride=(,))
(seq_bn) BatchNormd(6, eps=e-5, momentum=, affine=True, track_running_stats=True)
(seq_conv2) Convd(6, 64, kernel_size=(,), stride=(,))
(seq_bn2) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
(merge_conv) Convd(2, 64, kernel_size=(5,), stride=(,))
(merge_bn) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
(lstm) LSTM(64, 64)
(lstm2) LSTM(64, 64)
(fc) Linear(in_features=64, out_features=2, bias=True)
(dropout) Dropout(p=, inplace=False)
)
[442] Params (k) 4 | MACs (M) 245
[442] Preparing training settings
[54] Dataset loaded with labels control9,; oxoG42,24
[549] Train labels control4,; oxoG,24
[549] Held-out validation labels control5,; oxoG5,
[549] Training set validation labels control5,; oxoG5,
[549] Running initial validation
Batches 5it [, 2s/it]
Batches 5it [, 45it/s]
[9645] Start training
Epochs %| | / [5<6255, 26s/it, acc_train=9999, acc_val=9994, loss_train=26, loss_valEpochs %| | / [654<244, 4496s/it, acc_train=, acc_val=9996, acc_val=9996, loss_train=, loss_vaEpochs 2%|█▏ | 2/ [959<456, 9996s/it, acc_train=, acc_val=9996, loss_train=, loss_val=4]
Epoch Progress %|██████████| 4/4
[4962] No validation accuracy improvement after epochs Training stopped early
[4962] Saving final model checkpoint
[49522] Done
There are indeed a number of issues with this setup. I'll start with the fact that Remora is not intended for the processing of randomer strands directly. Randomer datasets are processed into Remora datasets using the Betta program. I would urge you to join the Betta program if the randomer approach is essential to your project.
To address the issues with the processing directly, the --motif TNNNNGNNNNG 5
argument sets the motif for the resulting dataset. Without other arguments this has the effect that only locations (basecalls and reference) matching this motif are included in the resulting dataset. But this also means that the resulting model will only make calls in the TNNNNGNNNNG
sequence of basecalls. This also means that every site where the reference and basecalls match this motif will be included in the resulting dataset. Overall it does not sound as though this is the intended target for this dataset.
The primary target for Remora-only (without access to Betta) data preparation and models is to annotate reference locations with canonical or modified bases. The motif argument is intended to limit the model to motifs and not necessarily as the selection criteria for the training chunks.
It looks like you are also acquiring very few training chunks. There appear to be some copying errors, but it looks like there may be very few training chunks. We generally recommend at least 1 million chunks for training. Training with fewer examples will likely lead to overtraining to the examples provided.
I hope this helps you along the track to processing your samples. Please post here if any further assistance is needed.