avocado icon indicating copy to clipboard operation
avocado copied to clipboard

Non-human species

Open FarmOmics opened this issue 1 year ago • 3 comments

Can use this software in non-human species, e.g. cattle? If yes, how I can build pre-trained model? If human model can be extended to other species?

FarmOmics avatar Oct 07 '22 20:10 FarmOmics

Hi @FarmOmics.

Yes, Avocado can be applied to any compendium of bulk genomic experiments. However, you need many experiments across tissues and assays for Avocado to be accurate. I don't know whether cattle have that many genomic experiments performed in them.

Yes, the human model can be extended to other species (see https://www.biorxiv.org/content/10.1101/801183v3) if you have an alignment between species genomes or can remap the reads from the experiments performed in human to the cattle genome. The first is less computationally intensive, because you don't need to remap several thousand experiments. However, you still need to have many experiments performed in cattle.

Let me know if you have any other qustions.

jmschrei avatar Oct 08 '22 16:10 jmschrei

I have cattle chipseq for five marks and ~20 tissues, if this set of data is enough to train a model? To train a model, your input data is npz format, e.g. E117.H3K9me3.pilot.arcsinh.npz, I am wondering the detailed step how I can prepare such kind of data ? I do have −log 10 p-values for chipseq signals, by the way? If I want to integrate human chipseq to train the model, can I liftover human chipseq signals to cattle coordinates?

FarmOmics avatar Oct 10 '22 16:10 FarmOmics

The way that Avocado is set up is that it can make predictions, even across species, for any assay that is measured at least once and any cell type that is assayed at least once. However, the predictions will be higher quality the more assays are available and the more related they are to the activity you're trying to predict. If you're trying to predict the binding of a very cell type-specific TF and only have a few histone modifications, you probably won't get great accuracy. But, if you're just trying to predict transcription from those histone modifications, you'll likely do pretty well because many histone modifications are correlated with expression.

The way you get your data into the model is just by extracting the -log10 p-values from your bigWig, probably using pyBigWig, and binning those values at 25bp resolution, taking the average across the positions. You can drop the last bin if your genome isn't divisible by 25.

Lifting over across species is more challenging because I didn't write clean code for that part. If you have som compute available, I'd actually recommend that you remap the human experiments you think are relevant to the cattle genome. The mapper will automatically take care of all the issues you might have using an alignment chain file (which I did). LiftOver would probably work as well.

Let me know if you have any other questions.

jmschrei avatar Oct 10 '22 18:10 jmschrei