ChEMBL_Structure_Pipeline icon indicating copy to clipboard operation
ChEMBL_Structure_Pipeline copied to clipboard

Add command line interface

Open marinegor opened this issue 3 years ago • 4 comments

Hi everyone,

as mentioned in #14 , I've added a command line interface to standartization from SMILES strings (namely, from input files containing SMILES as their first column). Also, I added an option to filter compounds using PAINS filters in RDKit as here -- it might be useful to switch it off by default, if you think it's more appropriate for this package.

The interface is following:

usage: chembl_std [-h] [-s] [-p] [-A] [-B] [-C] [--strict] [--header] [--verbose] [--stderr] INPUT

Sanitize smiles using chembl_structure_pipeline and RDKit PAINS filters

positional arguments:
  INPUT              Input file (with SMILES as first column)

optional arguments:
  -h, --help         show this help message and exit
  -s, --standartize  Whether to perform standartization of input SMILES (default: True)
  -p                 Filter molecules using all PAINS filters together (default: True)
  -A                 Filter molecules using all PAINS_A filter separately (default: False)
  -B                 Filter molecules using all PBINS_B filter separately (default: False)
  -C                 Filter molecules using all PCINS_C filter separately (default: False)
  --strict           Whether to raise an exception on first error (default: False)
  --header           Indicate that the input file contains header (default: False)
  --verbose          Whether to print all RDKit warnings to stdout (default: False)
  --stderr           Whether to print filtered molecules to stderr (default: False)

So in order to filter test.smi, one should do the following:

$ cat test.smi
smiles
c1ccccc1N=Nc1ccccc1
c1ccccc1N
CCO
$ chembl_std --header test.smi
smiles
c1ccccc1N
CCO

The downside is that it prints a lot of logging messages to stdout, and I could not completely disable them. For example, if I do chembl_std --header test.smi > out.smi, I'd get:

$ cat out.smi
smiles
c1ccccc1N
CCO
[01:33:17] Initializing Normalizer

The current workaround is to do chembl_std --header test.smi | grep -v Normalizer > out.smi. If someone knows how to manage it better, I'd appreciate.

marinegor avatar Sep 04 '21 22:09 marinegor

I think this could be merged.

UnixJunkie avatar Jan 17 '22 03:01 UnixJunkie

a -o option to say where the molecules passing std should be written to would be nice

UnixJunkie avatar Jan 17 '22 03:01 UnixJunkie

-o FILENAME

UnixJunkie avatar Jan 17 '22 03:01 UnixJunkie

mol_std should be printed out (in SMILES), rather than the SMILES line from the input file which passed standardization. I guess, people are interested in molecules after standardization, rather than which molecules from the input file passed standardization.

UnixJunkie avatar Jan 17 '22 03:01 UnixJunkie