[BUG] ctd (conjoint triad descriptors) features are not normalised and vary on each run.

Open MattElt opened this issue 1 year ago • 0 comments

Hello.

I've found this library very useful, but have recently noticed a bug. When calculating the ctd features I noticed differing results with the same list of sequences. Looking further into it I first thought that the column names (ctd_desc) were being assigned in a random order, but even when trying to match up columns with similar data and ignoring column headers, the data (ctd_arr) did not match identically on subsequent runs. Also, from the paper referenced, describing the ctd calculation, the output is supposed to be normalised, i.e. between 0 and 1. The output from ctd is given in integers.

import protlearn.features as ftr seqs = list(df[protein_sequence_column_name]) ctd_arr, ctd_desc = ftr.ctd(seqs) df = pd.DataFrame(data=ctd_arr, columns=ctd_desc)

It looks like there is some error in the implementation of the ctd function.

Versions: python 3.11.3, protlearn 0.0.3, pandas 2.0.3

Jan 30 '24 11:01 MattElt