avocado icon indicating copy to clipboard operation
avocado copied to clipboard

After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file

Open zhangyongzlm opened this issue 5 years ago • 1 comments

I am faced with a problem. I want to add my own NPC cell types (e.g., C15, C17, C17C ... X2117) to the existing models. Finally, I found that the size of the new model file is smaller than that of the corresponding ENCODE model file. image image I also try to load the newly generated model file and find that the NPC cell types are indeed added to the model. image

The following is my code for training the model.

import os, sys

os.environ["THEANO_FLAGS"] = "device=cuda0"
import matplotlib.pyplot as plt
import seaborn

seaborn.set_style("whitegrid")
import itertools
import numpy

numpy.random.seed(0)
from avocado import Avocado

import pandas as pd
import argparse
import math


parser = argparse.ArgumentParser(description="Train a new model")
parser.add_argument(
    "chrom", type=str, help="Specify the chromosome that training is performed in"
)
parser.add_argument(
    "--chromSize",
    action="store",
    dest="chromSize",
    type=str,
    default="./hg38.chrom.sizes",
    help="The file storing the chrom sie information",
)
parser.add_argument(
    "--batchsize",
    action="store",
    dest="batchsize",
    type=int,
    default=40000,
    help="Batch size for neural network predictions.",
)
args = parser.parse_args()

chrom_size = pd.read_table(args.chromSize, sep="\t", names=["chr", "size"])
chrom_size.set_index(["chr"], inplace=True)

celltypes = [
    "C15",
    "C17",
    "C17C",
    "C666-1",
    "NP460",
    "NP460_EBV",
    "NP69",
    "NP69_EBV",
    "NPC23",
    "NPC32",
    "NPC43",
    "NPC43noEBV",
    "NPC53",
    "NPC76",
    "X2117",
]
assays = [
    "ChIP-seq_H3K27ac_signal_p-value",
    "ChIP-seq_H3K4me1_signal_p-value",
    "ChIP-seq_H3K4me3_signal_p-value",
]

data = {}
for celltype, assay in itertools.product(celltypes, assays):
    filename = (
        "./signals/{}/{}/{}.{}.pval.signal.bw.{}.npz".format(celltype, assay.split("_")[1], celltype, assay.split("_")[1], args.chrom)
    )
    print(filename)
    data[(celltype, assay)] = numpy.load(filename)[args.chrom]

model = Avocado.load("./avocado/.encode2018core-model/avocado-" + args.chrom)
size = chrom_size.loc[args.chrom]["size"]
model.fit_celltypes(data, epoch_size=math.ceil(size / args.batchsize), n_epochs=200)

model.save("./model/NPC_" + args.chrom)

zhangyongzlm avatar Sep 11 '20 02:09 zhangyongzlm

That's weird, but I'm not necessarily sure that means there's a problem. Potentially, you have a higher compression level set for hdf5 files than I did. Can you still make predictions and everything fine?

jmschrei avatar Sep 21 '20 17:09 jmschrei