ChEMBL_Structure_Pipeline
ChEMBL_Structure_Pipeline copied to clipboard
Add command line interface
Hi everyone,
as mentioned in #14 , I've added a command line interface to standartization from SMILES strings (namely, from input files containing SMILES as their first column). Also, I added an option to filter compounds using PAINS filters in RDKit as here -- it might be useful to switch it off by default, if you think it's more appropriate for this package.
The interface is following:
usage: chembl_std [-h] [-s] [-p] [-A] [-B] [-C] [--strict] [--header] [--verbose] [--stderr] INPUT
Sanitize smiles using chembl_structure_pipeline and RDKit PAINS filters
positional arguments:
INPUT Input file (with SMILES as first column)
optional arguments:
-h, --help show this help message and exit
-s, --standartize Whether to perform standartization of input SMILES (default: True)
-p Filter molecules using all PAINS filters together (default: True)
-A Filter molecules using all PAINS_A filter separately (default: False)
-B Filter molecules using all PBINS_B filter separately (default: False)
-C Filter molecules using all PCINS_C filter separately (default: False)
--strict Whether to raise an exception on first error (default: False)
--header Indicate that the input file contains header (default: False)
--verbose Whether to print all RDKit warnings to stdout (default: False)
--stderr Whether to print filtered molecules to stderr (default: False)
So in order to filter test.smi
, one should do the following:
$ cat test.smi
smiles
c1ccccc1N=Nc1ccccc1
c1ccccc1N
CCO
$ chembl_std --header test.smi
smiles
c1ccccc1N
CCO
The downside is that it prints a lot of logging messages to stdout, and I could not completely disable them. For example, if I do chembl_std --header test.smi > out.smi
, I'd get:
$ cat out.smi
smiles
c1ccccc1N
CCO
[01:33:17] Initializing Normalizer
The current workaround is to do chembl_std --header test.smi | grep -v Normalizer > out.smi
. If someone knows how to manage it better, I'd appreciate.
I think this could be merged.
a -o option to say where the molecules passing std should be written to would be nice
-o FILENAME
mol_std should be printed out (in SMILES), rather than the SMILES line from the input file which passed standardization. I guess, people are interested in molecules after standardization, rather than which molecules from the input file passed standardization.