Yakov Pechersky comments

Results 60 comments of


Yakov Pechersky

Problems with the model_500k.h5

I am able to load the weights without issue, after freshly downloading both the smiles_500 and model_500 h5 files. Can you do me a favor, and run the following: ```...

Problems with the model_500k.h5

As far as I can tell, the model_500k.h5 that is in the data is older than the current preprocess code. I'd suggest trying `sample_gen.py` directly from the smiles datafiles. I'd...

Preprocess uses too much RAM

Would we be alright to switching to generator based training? That gets rid of the need to preprocess and the need to compress as well. On Mon, Nov 14, 2016...

Preprocess uses too much RAM

@dakoner can you provide a link to the 50M GDB-17 dataset you're using?

Preprocess uses too much RAM

The following branch should be able to train using a stream-based approach, requiring way less RAM. It also provides a solution for issue #39. Please test it out -- you'll...

Preprocess uses too much RAM

You might have gotten the epoch warning if your batch_size doesn't cleanly divide epoch_size. Thanks for your comments here and on the commit. Could you share the command that you...

Preprocess uses too much RAM

@dakoner There was a bug in encoding, it wasn't properly encoding padded words. I've also fixed the bugs you've pointed out. Now `train_gen` quickly reaches >60% accuracy within the first...

Preprocess uses too much RAM

The sampling is with replacement, so any epoch size can be used. I chose to use "with replacement" to make the generator have as least state as possible. For some...

In the `molecules.vectorizer`, `SmilesDataGenerator` takes a `test_split` optional parameter that creates the "index point" you mentioned. By default, it is `0.20`, so 4/5 of the data is used for training,...

Preprocess uses too much RAM

I should add that if you are training on 35K, and you assume the default `test_split=0.20`, then your "true" effective training set size is 28K. That's the epoch size you'll...