machine-learning icon indicating copy to clipboard operation
machine-learning copied to clipboard

When we run an analysis, what do we want to get back?

Open cgreene opened this issue 8 years ago • 11 comments

We need to design our results json so that we can later visualize the most important results via the results viewer from the UI team.

cgreene avatar Jul 19 '16 23:07 cgreene

F1 Score

autokad avatar Jul 19 '16 23:07 autokad

Confusion Matrix

autokad avatar Jul 19 '16 23:07 autokad

Y Hat

autokad avatar Jul 19 '16 23:07 autokad

prediction scores

cgreene avatar Jul 19 '16 23:07 cgreene

Feature ranking, a list of selected features. For GLM, F-stat/t-stat and p-values of predictors, model goodness of fit

yl565 avatar Jul 20 '16 01:07 yl565

We should probably save the sklearn estimators representing any transformations and the classifier. The sklearn doc recommends pickle for estimator persistence. Pickle is a binary serialization format in Python. @dcgoss, @awm33, and others -- can we store binary files in our database?

dhimmel avatar Jul 27 '16 17:07 dhimmel

@dhimmel relevant link: https://wiki.postgresql.org/wiki/BinaryFilesInDB#What_is_the_best_way_to_store_the_files_in_the_Database.3F

dcgoss avatar Jul 28 '16 01:07 dcgoss

Python object serialization to base64 encoded text

@dcgoss cool. I think we the following solution will work:

import base64
import pickle
payload = ['a', 'list', 2, 'encode']
byte_pickle = pickle.dumps(payload, protocol=4)
base64_text = base64.b64encode(byte_pickle).decode()
# Save `base64_text` using a text field in the database
byte_pickle = base64.b64decode(base64_text.encode())
pickle.loads(byte_pickle)

FYI base64_text, which would be saved the database is gANdcQAoWAEAAABhcQFYBAAAAGxpc3RxAksCWAYAAABlbmNvZGVxA2Uu.

dhimmel avatar Jul 28 '16 01:07 dhimmel

@dhimmel base64 text is usually fine for small sizes. Can also be stored as text in JSON fields. How big are the binaries? Is gANdcQAoWAEAAABhcQFYBAAAAGxpc3RxAksCWAYAAABlbmNvZGVxA2Uu a typical example?

awm33 avatar Jul 28 '16 02:07 awm33

I pickle-->base64-->text converted best_clf from the example notebook. The resulting string had 219,788 characters. I assume different types of classifiers will have different sizes.

If I add an extra step to compress, so the entire compression becomes:

byte_pickle = pickle.dumps(best_clf, protocol=4)
byte_pickle = zlib.compress(byte_pickle)
base64_text = base64.b64encode(byte_pickle).decode()

Then base64_text is only 11,468 characters. @awm33, is that okay?

dhimmel avatar Jul 28 '16 14:07 dhimmel

@dhimmel Compressing is a good move. If we think this would go into the tens of megabytes or more, we may want to consider using blob storage such as S3 or GCS. Postgres can handle gigabytes of text, but it's not great for performance.

awm33 avatar Jul 28 '16 14:07 awm33