peakachu icon indicating copy to clipboard operation
peakachu copied to clipboard

How to transform our HiC matrix data or processed reads to the training .bedpe file.

Open Yulong663 opened this issue 5 years ago • 7 comments

 Hi Xiaotao! It is a brilliant idea that introduce the machine learning framework into loop detections. 
 Recently i'm cope with Hi-C data and try to use the peakachu to identify loops. While i'm confused by the format of training data, i'm not sure what the seven column of the .bedpe file is. Could you tell me what the seven column of the .bedpe file is or add some explanation on the corresponding part of README file ?
Thanks a lot and look forward to your reply.  

Yulong663 avatar Jul 22 '20 13:07 Yulong663

Hi, the first 6 columns of the bedpe file are just interaction coordinates (https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format). The 7th column is optional and will not be used in the training. Let me know if it's not clear.

XiaoTaoWang avatar Jul 22 '20 16:07 XiaoTaoWang

Hi, the first 6 columns of the bedpe file are just interaction coordinates (https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format). The 7th column is optional and will not be used in the training. Let me know if it's not clear.

Thanks for the clarification. Another question is how to transfom data into positive and negative training data? Does the positive training data was transformed based on the location of pre-existed loop identified by other Loop-calllers? Thanks.

Yulong663 avatar Jul 23 '20 10:07 Yulong663

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

tariks avatar Jul 23 '20 11:07 tariks

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

Thanks for the clarification how the positive and negative training set works. From the method part in the published paper what i get is that positive training set is a set of coordinates that around loops. And that is what i'm really asking: how to choose a set of positive training coordinates? And what does the "a set of high-confidence manually selected interactions" mean.. the high-confidence is confident for what ? (for loop?) Thanks a lot for your prompt reply 👍

Yulong663 avatar Jul 23 '20 12:07 Yulong663

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

By the way, is it a good idea to use the training set provided in repository to train tmy own model?

Yulong663 avatar Jul 23 '20 13:07 Yulong663

Positive set will reflect garbage-in-garbage-out philosophy. peakachu is intended to receive coordinates determined by another technology (chia pet, etc) in the same cell type. When we tried manually selecting 200 obvious loops from the same HiC map, we got comparable results. Ideally, Peakachu tries to answer the question "can I find loops in X experiment that are similar to Y experiment." The training sets included in the repo are cell-line specific for the training step, but the resulting model can be applied to any cell type at a similar read depth.

tariks avatar Jul 23 '20 14:07 tariks

Positive set will reflect garbage-in-garbage-out philosophy. peakachu is intended to receive coordinates determined by another technology (chia pet, etc) in the same cell type. When we tried manually selecting 200 obvious loops from the same HiC map, we got comparable results. Ideally, Peakachu tries to answer the question "can I find loops in X experiment that are similar to Y experiment." The training sets included in the repo are cell-line specific for the training step, but the resulting model can be applied to any cell type at a similar read depth.

Thanks tariks :)

Yulong663 avatar Jul 24 '20 02:07 Yulong663