peakachu How to transform our HiC matrix data or processed reads to the training .bedpe file.

 Hi Xiaotao! It is a brilliant idea that introduce the machine learning framework into loop detections. 
 Recently i'm cope with Hi-C data and try to use the peakachu to identify loops. While i'm confused by the format of training data, i'm not sure what the seven column of the .bedpe file is. Could you tell me what the seven column of the .bedpe file is or add some explanation on the corresponding part of README file ？
Thanks a lot and look forward to your reply.

Jul 22 '20 13:07 Yulong663

Hi, the first 6 columns of the bedpe file are just interaction coordinates (https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format). The 7th column is optional and will not be used in the training. Let me know if it's not clear.

Jul 22 '20 16:07 XiaoTaoWang

Hi, the first 6 columns of the bedpe file are just interaction coordinates (https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format). The 7th column is optional and will not be used in the training. Let me know if it's not clear.

Thanks for the clarification. Another question is how to transfom data into positive and negative training data? Does the positive training data was transformed based on the location of pre-existed loop identified by other Loop-calllers? Thanks.

Jul 23 '20 10:07 Yulong663

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

Jul 23 '20 11:07 tariks

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

Thanks for the clarification how the positive and negative training set works. From the method part in the published paper what i get is that positive training set is a set of coordinates that around loops. And that is what i'm really asking: how to choose a set of positive training coordinates? And what does the "a set of high-confidence manually selected interactions" mean.. the high-confidence is confident for what ? (for loop?) Thanks a lot for your prompt reply 👍

Jul 23 '20 12:07 Yulong663

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

By the way, is it a good idea to use the training set provided in repository to train tmy own model?

Jul 23 '20 13:07 Yulong663

Positive set will reflect garbage-in-garbage-out philosophy. peakachu is intended to receive coordinates determined by another technology (chia pet, etc) in the same cell type. When we tried manually selecting 200 obvious loops from the same HiC map, we got comparable results. Ideally, Peakachu tries to answer the question "can I find loops in X experiment that are similar to Y experiment." The training sets included in the repo are cell-line specific for the training step, but the resulting model can be applied to any cell type at a similar read depth.

Jul 23 '20 14:07 tariks

Positive set will reflect garbage-in-garbage-out philosophy. peakachu is intended to receive coordinates determined by another technology (chia pet, etc) in the same cell type. When we tried manually selecting 200 obvious loops from the same HiC map, we got comparable results. Ideally, Peakachu tries to answer the question "can I find loops in X experiment that are similar to Y experiment." The training sets included in the repo are cell-line specific for the training step, but the resulting model can be applied to any cell type at a similar read depth.

Thanks tariks :)

Jul 24 '20 02:07 Yulong663