machine-learning
machine-learning copied to clipboard
When we run an analysis, what do we want to get back?
We need to design our results json so that we can later visualize the most important results via the results viewer from the UI team.
F1 Score
Confusion Matrix
Y Hat
prediction scores
Feature ranking, a list of selected features. For GLM, F-stat/t-stat and p-values of predictors, model goodness of fit
We should probably save the sklearn estimators representing any transformations and the classifier. The sklearn doc recommends pickle for estimator persistence. Pickle is a binary serialization format in Python. @dcgoss, @awm33, and others -- can we store binary files in our database?
@dhimmel relevant link: https://wiki.postgresql.org/wiki/BinaryFilesInDB#What_is_the_best_way_to_store_the_files_in_the_Database.3F
Python object serialization to base64 encoded text
@dcgoss cool. I think we the following solution will work:
import base64
import pickle
payload = ['a', 'list', 2, 'encode']
byte_pickle = pickle.dumps(payload, protocol=4)
base64_text = base64.b64encode(byte_pickle).decode()
# Save `base64_text` using a text field in the database
byte_pickle = base64.b64decode(base64_text.encode())
pickle.loads(byte_pickle)
FYI base64_text
, which would be saved the database is gANdcQAoWAEAAABhcQFYBAAAAGxpc3RxAksCWAYAAABlbmNvZGVxA2Uu
.
@dhimmel base64 text is usually fine for small sizes. Can also be stored as text in JSON fields. How big are the binaries? Is gANdcQAoWAEAAABhcQFYBAAAAGxpc3RxAksCWAYAAABlbmNvZGVxA2Uu
a typical example?
I pickle-->base64-->text converted best_clf
from the example notebook. The resulting string had 219,788 characters. I assume different types of classifiers will have different sizes.
If I add an extra step to compress, so the entire compression becomes:
byte_pickle = pickle.dumps(best_clf, protocol=4)
byte_pickle = zlib.compress(byte_pickle)
base64_text = base64.b64encode(byte_pickle).decode()
Then base64_text
is only 11,468 characters. @awm33, is that okay?
@dhimmel Compressing is a good move. If we think this would go into the tens of megabytes or more, we may want to consider using blob storage such as S3 or GCS. Postgres can handle gigabytes of text, but it's not great for performance.