e3fp HDF5 writing errors due to previously generated conformers

When conformer/generate.py is run to a folder containing previously generated conformers, the following error appears when writing values to the file specified by --values_file:

2017-05-19 13:05:22,864|ERROR|Problem writing values to /netapp/home/ali/projects/e3fp/confgen/hdf5/test.hdf5.
Traceback (most recent call last):
  File "/netapp/home/ali/e3fp/e3fp/conformer/generate.py", line 166, in values_to_hdf5
    target_conformers, indices, energies, rmsd) = values
TypeError: 'bool' object is not iterable

(This error does not occur if the -O overwrite option is specified)

generate.py should probably load in the values for the previously generated conformers so that they can be written to the value file.

May 19 '17 20:05 8li

I agree, we should have a more informative error, but loading the previously generated conformers probably isn't a good idea. The main purpose of the values file is to write energies, but I've noticed that even when using the same forcefield in RDKit to recompute energies as was used to compute them in the first place, the values aren't the same. I'm not certain why this would be, perhaps an RDKit implementation detail. To combine originally computed energies with recomputed energies could thus produce misleading results.

May 20 '17 07:05 sethaxen

we discussed this today & suggest introducing a flag where:

there is no recomputing of energies (agreed, that's odd);
wherever conformer + originally computed energies exist, keep them & load that all in;
1. overwrite/ignore existing conformers that have no original energies;
write a final appended values file of original energies + resumed molecules.

May 23 '17 20:05 mjke

So, if the values_file exists and the -O overwrite option is not specified, then generate.py will write to the file in an append mode.

The error in the first comment ('bool' object is not iterable) still comes up every time the conformers already exist. This behavior in itself isn't too problematic, but it is also not an error, so the logging messages should reflect so.

Since values_file is written to in chunks, there is an issue that comes up when a run ends with conformers for 1175 molecules have been generated, but values for only 1100 molecules were written to the values_file. If I restart the run to process additional molecules, generate.py should identify the missing molecules using the values_file (which is in line with suggestion (2) from @mjke). I don't think a separate flag is necessary, though; specifying values_file and/or -O should be enough to specify the desired behavior.

May 25 '17 00:05 8li

Somewhat related to this issue, conformer energies are now stored in SDF files (see d9c3b1a). HDF5 file capabilities are still present for when one wants all energies stored in a single place, along with RMSDs, etc.

Jun 01 '17 01:06 sethaxen