We need to design our results json so that we can later visualize the most important results via the results viewer from the UI team.

Jul 19 '16 23:07 cgreene

F1 Score

Jul 19 '16 23:07 autokad

Confusion Matrix

Jul 19 '16 23:07 autokad

Y Hat

Jul 19 '16 23:07 autokad

prediction scores

Jul 19 '16 23:07 cgreene

Feature ranking, a list of selected features. For GLM, F-stat/t-stat and p-values of predictors, model goodness of fit

Jul 20 '16 01:07 yl565

We should probably save the sklearn estimators representing any transformations and the classifier. The sklearn doc recommends pickle for estimator persistence. Pickle is a binary serialization format in Python. @dcgoss, @awm33, and others -- can we store binary files in our database?

Jul 27 '16 17:07 dhimmel

@dhimmel relevant link: https://wiki.postgresql.org/wiki/BinaryFilesInDB#What_is_the_best_way_to_store_the_files_in_the_Database.3F

Jul 28 '16 01:07 dcgoss

Python object serialization to base64 encoded text

@dcgoss cool. I think we the following solution will work:

import base64
import pickle
payload = ['a', 'list', 2, 'encode']
byte_pickle = pickle.dumps(payload, protocol=4)
base64_text = base64.b64encode(byte_pickle).decode()
# Save `base64_text` using a text field in the database
byte_pickle = base64.b64decode(base64_text.encode())
pickle.loads(byte_pickle)

FYI base64_text, which would be saved the database is gANdcQAoWAEAAABhcQFYBAAAAGxpc3RxAksCWAYAAABlbmNvZGVxA2Uu.

Jul 28 '16 01:07 dhimmel

@dhimmel base64 text is usually fine for small sizes. Can also be stored as text in JSON fields. How big are the binaries? Is gANdcQAoWAEAAABhcQFYBAAAAGxpc3RxAksCWAYAAABlbmNvZGVxA2Uu a typical example?

Jul 28 '16 02:07 awm33

I pickle-->base64-->text converted best_clf from the example notebook. The resulting string had 219,788 characters. I assume different types of classifiers will have different sizes.

If I add an extra step to compress, so the entire compression becomes:

byte_pickle = pickle.dumps(best_clf, protocol=4)
byte_pickle = zlib.compress(byte_pickle)
base64_text = base64.b64encode(byte_pickle).decode()

Then base64_text is only 11,468 characters. @awm33, is that okay?

Jul 28 '16 14:07 dhimmel

@dhimmel Compressing is a good move. If we think this would go into the tens of megabytes or more, we may want to consider using blob storage such as S3 or GCS. Postgres can handle gigabytes of text, but it's not great for performance.

Jul 28 '16 14:07 awm33

machine-learning
machine-learning copied to clipboard

When we run an analysis, what do we want to get back?

Python object serialization to base64 encoded text

machine-learning machine-learning copied to clipboard

When we run an analysis, what do we want to get back?

Python object serialization to base64 encoded text

machine-learning
machine-learning copied to clipboard