xqtl-protocol icon indicating copy to clipboard operation
xqtl-protocol copied to clipboard

Mem optimization of sumstat standardization

Open hsun3163 opened this issue 2 years ago • 4 comments

This ticket is dedicated to problem 8 in #412. To records potential optimization options


  1. reducing reuse of unneeded data. At the moment, full rows of the query table will be called into the compare_snp function. However, those information really was not used. So perhaps changing
def snps_match_dup(query,subject,keep_ambiguous=True):
    pm = compare_snps(query,subject)
    if not keep_ambiguous:
        pm = pm[~pm.ambiguous]
    new_subject = subject.loc[pm.sidx]
    #update beta and snp info
    new_query = pd.concat([new_subject.iloc[:,:5],query.loc[pm.qidx].iloc[:,5:]],axis=1)
    new_query.loc[list(pm.flip) , "STAT"] = -new_query.STAT[list(pm.flip)]
    return new_query, new_subject

into

def snps_match_dup(query,subject,keep_ambiguous=True):
    pm = compare_snps(query.iloc[:,0:5],subject)
    if not keep_ambiguous:
        pm = pm[~pm.ambiguous]
    new_subject = subject.loc[pm.sidx]
    #update beta and snp info
    new_query = pd.concat([new_subject.iloc[:,:5],query.loc[pm.qidx].iloc[:,5:]],axis=1)
    new_query.loc[list(pm.flip) , "STAT"] = -new_query.STAT[list(pm.flip)]
    return new_query, new_subject

can save us some mem

hsun3163 avatar Oct 03 '22 20:10 hsun3163