remora icon indicating copy to clipboard operation
remora copied to clipboard

Workflow to extract chunks for a randomer dataset

Open moa4020 opened this issue 2 months ago • 1 comments

Hi Marcus,

We have finally managed to get a dataset that has good coverage of randomers with and without 8-oxoG at the center surrounded by 4 random bases on either side. I would like to train a model on this dataset by extracting 5-mer chunks and would like your help with extracting these chunks from my dataset.

Do I start off my trimming the reads so I isolate the randomer by itself and segment/extract 5-mer chunks out of each 9-mer? or is there a better way to use remora to do this?

Thanks, Mohith

moa4020 avatar May 01 '24 04:05 moa4020

Remora does not directly support randomer processing. Randomer processing is quite a bit more involved and thus has been stored in the Betta repository. I would recommend contacting technical/customer support in order to apply for access to Betta.

At a high level though, 5-mers are not likely to be a large enough random context to train a robust model. Remora does not extract chunks of fixed sequence length, but instead extracts fix signal length chunks. These thus contain variable widths of sequence and the constant sequence outside of your randomer would then be included in may chunks. Applying this model to a new chunk of data without the same context may have unexpected results. We would recommend at least 20 and ideal >40 bases of random bases around the focus base of the randomer.

I hope this helps a bit, and would be happy to help further if you are able to gain access to Betta.

marcus1487 avatar May 01 '24 04:05 marcus1487