TOBIAS
TOBIAS copied to clipboard
Different locus numbers across conditions and Strategies for Multi-Tissue TF Footprint Score Comparison
Hello, I appreciate your continued assistance, it has been very useful to me!!
I am working on multiple conditional data, And I want to identify which TFs are important in each condition in certain bed regions. Just similar with your research.
I've observed significant variations in the locus numbers within BINDetect results across different conditions. For instance, in <condition1_bindetect>/<TF1>/TF1_overview.txt, there are over 20,000 rows, whereas in <condition2_bindetect>/<TF1>/TF1_overview.txt, there are over 30,000 rows.
Based on this description ,I initially presumed that the locus numbers would be consistent across conditions.
Based on this, Could the TFBS with no output be a reason for F[i, i+Wf] < 0? I noticed that there are TFBS_footprints_condition_score=0 in the output.
Here comes the following questions:
- When comparing the significance of TF1 in my region of interest across conditions, and given there are differing numbers of binding sites in each condition, For example, the BINDetect result shows 7 sites in condition1, but shows 7 same sites and 3 more sites in condition2 . should I use the maximum TF_condition score or the mean TF_condition score? I lean towards the mean strategy from a biological perspective, but I'm unsure if it's fair to divide condition1, which has only 7 binding sites, by 7 when it seems that in three other sites, condition1 may not even have binding, unlike condition2. However, if divided by 10, it seems that the footprint scores on the other three sites are not necessarily 0, as I mentioned earlier, there are TFBS_footprints_condition_score=0 in the output
- After obtaining the mean TF_condition score for each condition, you mentioned that
So I think maybe I don't need to perform additional normalization, right? But I've noted a clear bias in certain situations, but biologically, it seems improbable that all TFs would exhibit this pattern.
Could I be overlooking something? To clarify, I use ATAC peaks from the entire genome as input. I then employ bedtools intersect with the BINDetect results and the regions of interest to obtain the footprint scores within those specific regions. This approach differs from directly using the peaks within my regions of interest as input.
I apologize for the barrage of questions, and I hope you have a wonderful Halloween!