keras-molecules icon indicating copy to clipboard operation
keras-molecules copied to clipboard

The unnatural encoding of current implementation

Open hsiaoyi0504 opened this issue 8 years ago • 4 comments

After testing, I found that the procedure of building a list of the unique characters used in the dataset (The "charset") is wired. Current encoding will make the resulting output much fragile, because we didn't avoid the situation of Cl interpreted as "C", "l". For example, we should treat 'Cl' as independent character rather than 'C' and 'l' directly. It chemically unreasonable to see 'l' along.

hsiaoyi0504 avatar Jan 04 '17 13:01 hsiaoyi0504

Have this problem ever been addressed? Apart from this the charsets are not stable between different training datasets, yielding incompatible models.

grayfall avatar Mar 28 '17 13:03 grayfall

I suggest checking out the paper and repo I cite in #62. It also has pretrained models if you need that.

pechersky avatar Mar 28 '17 14:03 pechersky

@pechersky do you accept pull requests? I've made some improvements to your preprocessing routine and the CLI. Most importantly, I changed the parsing scheme to address the issues mentioned here.

grayfall avatar Mar 30 '17 12:03 grayfall

Yeah, go ahead and make a PR.

On Thu, Mar 30, 2017 at 8:48 AM, Eli [email protected] wrote:

@pechersky https://github.com/pechersky do you accept pull requests? I've made some improvements to your preprocessing routine and the CLI. Most importantly, I changed the parsing scheme to address the issues mentioned here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/54#issuecomment-290400244, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhiUCRPs3JDSigy6wG3O-EXw4DTFSks5rq6SYgaJpZM4LapUj .

pechersky avatar Mar 30 '17 13:03 pechersky