deepsmiles
deepsmiles copied to clipboard
Consider compressing parentheses
It has been mentioned that replacing the multiple close parentheses by a number plus a single parenthesis would be a good compression strategy. This of course is true. What I don't know is whether it would make it easier for a ML method to use/learn/generate the string. But I guess I can add an option to control this.
In the meanwhile, maybe I can provide a piece of Python code that does the transformation for anyone interested.
Or, going the other way, use "%%%%" instead of "%3". "CCCC%" would be "CC1CC1", "CCCC%%" would be "C1CCC1", etc.
As Noel writes, just need some way to evaluate which is more effective.
I agree the multiple consecutive parenthesis is the only weird thing in the proposed syntax. If ML can "understand" ring size, I guess that it could also understand "branch length".