remora icon indicating copy to clipboard operation
remora copied to clipboard

100% training accuracy - unable to navigate mistake

Open moa4020 opened this issue 3 months ago • 1 comments

Hi Marcus,

I managed to make a dataset that has separate 10+ nucleotide handles on either side of the unmod G and mod G randomers with the structure: LeftHandle1-NNNNGNNNN-RightHandle1 and LeftHandle2-NNNNGNNNN-RightHandle2.

And I'm trying to extract chunks from this dataset but I don't think I'm doing it right. Here is my code for extracting chunks:

remora dataset prepare pod5Dir/merged.pod5 bamDir/v02.2_trimmed_aligned_moves.bam --output-path $outDir/controlchunks --refine-kmer-level-table $outDir/9mer_levels_v2.txt --refine-rough-rescale --motif TNNNNGNNNNG 5 --mod-base-control --num-extract-chunks-workers 2

remora dataset prepare pod5Dir/merged.pod5 bamDir/v02.2_trimmed_aligned_moves.bam --output-path $outDir/modchunks --refine-kmer-level-table $outDir/9mer_levels_v2.txt --refine-rough-rescale --motif CNNNNGNNNNT 5 --mod-base o 8oxoG --num-extract-chunks-workers 2

Then I made a config and trained the model:

remora \
  dataset make_config \
  train_dataset.jsn \
  controlchunks \
  modchunks \
  --dataset-weights 1 1 \
  --log-filename train_dataset.log
remora \
  model train \
  train_dataset.jsn \
  --model models/ConvLSTM_w_ref.py \
  --device 0 \
  --chunk-context 50 50 \
  --output-path train_results

And this is where things start feeling fishy. Because the output from the training shows that the central position is 7. But it shouldn't be. The central position/focus base should be the G in the middle - which is on the 6th position (5, according to python).

Output:

(base) [moa42@cayuga-login err]$ cat remora_train_295err
[56] Seed selected is 92422
[5242] Loading dataset from Remora dataset config
[5] Dataset summary
                     size  26,59
     modified_base_labels  True
                mod_bases  ['o']
           mod_long_names  ['oxoG']
       kmer_context_bases  (4, 4)
            chunk_context  (5, 5)
                   motifs  [('CNNNNGNNNNT', 5), ('TNNNNGNNNNG', 5)]
           reverse_signal  False
 chunk_extract_base_start  False
     chunk_extract_offset  
          sig_map_refiner  Loaded 9-mer table with  central position Rough re-scaling will be executed

[5] Loading model
[646] Model structure
network(
  (sig_conv) Convd(, 4, kernel_size=(5,), stride=(,))
  (sig_bn) BatchNormd(4, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (sig_conv2) Convd(4, 6, kernel_size=(5,), stride=(,))
  (sig_bn2) BatchNormd(6, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (sig_conv) Convd(6, 64, kernel_size=(9,), stride=(,))
  (sig_bn) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (seq_conv) Convd(6, 6, kernel_size=(5,), stride=(,))
  (seq_bn) BatchNormd(6, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (seq_conv2) Convd(6, 64, kernel_size=(,), stride=(,))
  (seq_bn2) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (merge_conv) Convd(2, 64, kernel_size=(5,), stride=(,))
  (merge_bn) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (lstm) LSTM(64, 64)
  (lstm2) LSTM(64, 64)
  (fc) Linear(in_features=64, out_features=2, bias=True)
  (dropout) Dropout(p=, inplace=False)
)
[442] Params (k) 4 | MACs (M) 245
[442] Preparing training settings
[54] Dataset loaded with labels control9,; oxoG42,24
[549] Train labels control4,; oxoG,24
[549] Held-out validation labels control5,; oxoG5,
[549] Training set validation labels control5,; oxoG5,
[549] Running initial validation
Batches 5it [,  2s/it]
Batches 5it [, 45it/s]
[9645] Start training
Epochs   %|          | / [5<6255, 26s/it, acc_train=9999, acc_val=9994, loss_train=26, loss_valEpochs   %|          | / [654<244, 4496s/it, acc_train=, acc_val=9996,  acc_val=9996, loss_train=, loss_vaEpochs  2%|█▏        | 2/ [959<456, 9996s/it, acc_train=, acc_val=9996, loss_train=, loss_val=4]
Epoch Progress %|██████████| 4/4
[4962] No validation accuracy improvement after  epochs Training stopped early
[4962] Saving final model checkpoint
[49522] Done

moa4020 avatar Mar 11 '24 23:03 moa4020

There are indeed a number of issues with this setup. I'll start with the fact that Remora is not intended for the processing of randomer strands directly. Randomer datasets are processed into Remora datasets using the Betta program. I would urge you to join the Betta program if the randomer approach is essential to your project.

To address the issues with the processing directly, the --motif TNNNNGNNNNG 5 argument sets the motif for the resulting dataset. Without other arguments this has the effect that only locations (basecalls and reference) matching this motif are included in the resulting dataset. But this also means that the resulting model will only make calls in the TNNNNGNNNNG sequence of basecalls. This also means that every site where the reference and basecalls match this motif will be included in the resulting dataset. Overall it does not sound as though this is the intended target for this dataset.

The primary target for Remora-only (without access to Betta) data preparation and models is to annotate reference locations with canonical or modified bases. The motif argument is intended to limit the model to motifs and not necessarily as the selection criteria for the training chunks.

It looks like you are also acquiring very few training chunks. There appear to be some copying errors, but it looks like there may be very few training chunks. We generally recommend at least 1 million chunks for training. Training with fewer examples will likely lead to overtraining to the examples provided.

I hope this helps you along the track to processing your samples. Please post here if any further assistance is needed.

marcus1487 avatar Mar 12 '24 22:03 marcus1487