skops icon indicating copy to clipboard operation
skops copied to clipboard

Show users a readable version of what's in the pickle file when downloading from the hub

Open adrinjalali opened this issue 3 years ago • 4 comments

When downloading from the hub, we can show users a somewhat human readable version of what's in the file, and warn them that it can be malicious. One can do that by passing the pickle file to fickling: https://github.com/trailofbits/fickling

@BenjaminBossan is this something you'd like to tackle? I think it'd be a nice one.

Also cc @McPatate and @Narsil here.

This doesn't make things "safe" per say, but it does increase the bar for malicious actors I'd say.

adrinjalali avatar Jul 18 '22 10:07 adrinjalali

I'll take a look. It is probably impossible to give any safety guarantees, so I assume we would just rely on fickling (btw. what a bad library name for a German).

we can show users a somewhat human readable version of what's in the file

You mean the AST?

BenjaminBossan avatar Jul 18 '22 12:07 BenjaminBossan

Not the AST, just the output of fickling blah.pkl really, to start with.

adrinjalali avatar Jul 18 '22 12:07 adrinjalali

I tried it out just to see some results. It didn't work on pytorch or skoch code, probably not too surprising. I also tested it on a simple sklearn pipeline consting of a MinMaxScaler and a LogisticRegression, here are the results:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing._data import MinMaxScaler
from numpy.core.multiarray import _reconstruct
from numpy import ndarray
_var0 = _reconstruct(ndarray, (0,), b'b')
from numpy import dtype
_var1 = dtype('f8', False, True)
_var2 = _var1
_var2.__setstate__((3, '<', None, None, None, -1, -1, 0))
_var3 = _var0
_var3.__setstate__((1, (20,), _var2, False, b'\xda\x97\xdc,\xdd\xc0\xce?\xc0S\xdf\xb7t\xd2\xca?\xe4\xfd\xdb\x9d!L\xca?q\x0bl\xed\xc78\xc3?V\t\xa9l60\xc9?\x98\x80\xa9)\xb7\x1d\xd9?\x81\x98{x\x80 \xc8?\xb8\x8c\xf3\xae\xbaI\xcb?`\xc3\x9f\t\xb2F\xc4?b_\xb3\t\xb6\xb1\xcd?\xcf\x1c\x88\x9c.\x12\xc3?4;\xd3\x85w\x10\xcc?\x9aGN5\xf0\xd3\xc6?(E\xad\x7f\xc6\xc3\xc4?um\xc0\x90\xf85\xc6?\x11\xe7r\xbd\xf3\xdf\xcc?Cu\x0e\x87|6\xcc?\xdda>\x80\x16I\xcc?^\xccD\xed\x0bD\xcd?\xe3(f\xd5\x18h\xc9?'))
_var4 = _reconstruct(ndarray, (0,), b'b')
_var5 = _var4
_var5.__setstate__((1, (20,), _var1, False, b"\xe6V\x0e\xeaX\xc0\xe1?\xe3\x07\x96\x10\xa36\xdd?K-!\xaf\xa1\xc1\xdc?\xa6I\x81$\x81\x18\xd9?9\xbe\xae\x02\x1e\n\xdc?K\x8e\n\x1a\xe9\x8e\xe3?\xe3\x92\xdaa2\x9c\xdd?j\xe9*\xb0%W\xde?{N\x1ef\x9a\xed\xd8?\xafTz2\x8c\x95\xdf?b\xbe\xe6\xe8$\xcf\xe1?\xed\xa5\xdf\xfck/\xe0?\x0b\xa6\xcd\xc3\x8e_\xe1?\xecG,*h\x1a\xdc?w\xe0m<\x08$\xd8?\t\xde\xd2N\xe6\x04\xdc?I\x9d\xb1\x07\xd9\xfe\xda?\x1dXq\xbciM\xe1?\xae\r\nw\xab'\xe2?\xd5\x03\xa6\x10\xf0k\xe2?"))
_var6 = _reconstruct(ndarray, (0,), b'b')
_var7 = _var6
_var7.__setstate__((1, (20,), _var1, False, b'\xb9*\xd5\xd3\x8ex\x02\xc0/W\xa0\x1f.m\x01\xc0\x9f\x0c\xc1\x04\x01\x7f\x01\xc0r(R\x83\xae\xe3\x04\xc0h]zn\xa5\xcf\x01\xc0E-\xb1\xfa9\xeb\xf8\xbf\x18\x8dAP\xdf\xa2\x03\xc0\xb1\xb2\xab\xdb,\xca\x01\xc0\xe9\xed\xc0\x15\xc0\xab\x03\xc0\x17@\xa2(\xb4\x04\x01\xc0eT.]\xf6\xe1\r\xc0\xed8\xcf[|t\x02\xc0\x02.\xa2\xba\x8fZ\x08\xc0.m\xa4\xf4\x8d\xa7\x05\xc0\x99\xdf\x87\x18\xe8c\x01\xc0e#\xd8-=\r\xff\xbf[\x0evD\x87\x9e\xfe\xbf2\x89oV\x17\x93\x03\xc0\xdb\x80\xbfF\xe2\xd9\x03\xc0\x847\x0e\xea\xc93\x07\xc0'))
_var8 = _reconstruct(ndarray, (0,), b'b')
_var9 = _var8
_var9.__setstate__((1, (20,), _var1, False, b'\x93\x95\xec\x83\x07\xa7\xfd?\x8e\xda\xd9[A\xc0\x04@N}\xbeCpq\x05@\x1a=-Q\x151\x10@\x13q%\x7f\xb6\xd7\x06@\xdd>\xab6%\xb4\xef?{FB>Q\xce\x06@r\x9a~\xdbc\xbc\x03@n(\x87&\xef\xd4\x0e@\x03.\x1f\x1flw\x01@T\xb7\x03,\xb5\xcf\x07@\x0b\x13EjW\x08\x02@\xf2\x11.3\xf0\x80\x04@h\x02M0\xeb\xa8\x0b@33\xac\x89\xa0\xb6\x0c@\x1d\x1bO\xf8\x04\xf0\x03@\xabo\xe3vd\xfc\x04@\x87\xefND\xb2\xa0\x00@\x13\x14HP\xedF\xfe?}\xa3\xb8\xfd%\x1a\x01@'))
_var10 = _reconstruct(ndarray, (0,), b'b')
_var11 = _var10
_var11.__setstate__((1, (20,), _var1, False, b'\xc1\xba\xe5J\t\xa6\x10@\xde\x18\xbd\xbd\xb7\x16\x13@\xf6\xc4?\xa48x\x13@SQ\xd6\x92\xec\xa2\x1a@>\xe7\xcf\xf6\xadS\x14@Zf\x03K\xa6b\x04@\xca\xe9AG\x988\x15@\x92&\x95[H\xc3\x12@,\x0b$\x9eW@\x19@\r\xb7\xe0#\x10>\x11@\xdc\x05\x99\xc4\xd5\xd8\x1a@\xfc%\n\xe3i>\x12@\xfa\x1f\xe8\xf6\xbfm\x16@\xcb\xb7x\x92<\xa8\x18@f\t\x1aQD\r\x17@h\x96\x9d\xc7Q\xbb\x11@l;\x8f\x0c\xd4%\x12@\\<_\xcd\xe4\x19\x12@r\xc5qw\xac~\x11@\x80m\xe3\xf3\xf7&\x14@'))
_var12 = MinMaxScaler()
_var12.__setstate__({'feature_range': (0, 1), 'copy': True, 'clip': False, 'n_features_in_': 20, 'n_samples_seen_': 100, 'scale_': _var3, 'min_': _var5, 'data_min_': _var7, 'data_max_': _var9, 'data_range_': _var11, '_sklearn_version': '1.1.1'})
from sklearn.linear_model._logistic import LogisticRegression
_var13 = _reconstruct(ndarray, (0,), b'b')
_var14 = dtype('i8', False, True)
_var15 = _var14
_var15.__setstate__((3, '<', None, None, None, -1, -1, 0))
_var16 = _var13
_var16.__setstate__((1, (2,), _var15, False, b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00'))
_var17 = _reconstruct(ndarray, (0,), b'b')
_var18 = dtype('i4', False, True)
_var19 = _var18
_var19.__setstate__((3, '<', None, None, None, -1, -1, 0))
_var20 = _var17
_var20.__setstate__((1, (1,), _var19, False, b'\x15\x00\x00\x00'))
_var21 = _reconstruct(ndarray, (0,), b'b')
_var22 = _var21
_var22.__setstate__((1, (1, 20), _var1, False, b'\xd6\xa3\x1f\xb9\xf3\\\xd2?\xaa\x88\xa7NnK\xc9\xbfk\xff\x13R!\xa3\xcd?\x1eZ\xc7N\x96U\xdf\xbfC\xe3\xc9\xa2\x81\xc9U?\xdd\x1e.(\x96\xff\xd0?3\xd4|\x1bd\xb9\xe0\xbfx^.\r$\n\xb9\xbf\xc0\xc6\xf3\xe2\x08r\xc7?e\xf9\t\xb2h\xb7\xbd?\xa5v\xe5\x85\xbe\xc6\x0e@+AUP\xb4?\xa8?\xf6\xccy\xf2\xd7\xbd\xed\xbf\xb6\xdf\x11O\xcav\x9b\xbf\x82\t\xb52\x93\x12\xf3?\x8a\x8a\xaf\x86r\x95\xc8?\x0e,f\x12+\xce\xeb?\x1f\x9bp\x0c\xa52\xda\xbf\xb7W\xa9L\x99\xb7\x97\xbf\xc9\r\xc4\x1d\x85\x89\xc2?'))
_var23 = _reconstruct(ndarray, (0,), b'b')
_var24 = _var23
_var24.__setstate__((1, (1,), _var1, False, b'}\x89\xb7;\xe6\x99\x03\xc0'))
_var25 = LogisticRegression()
_var25.__setstate__({'penalty': 'l2', 'dual': False, 'tol': 0.0001, 'C': 1.0, 'fit_intercept': True, 'intercept_scaling': 1, 'class_weight': None, 'random_state': 123, 'solver': 'lbfgs', 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None, 'n_features_in_': 20, 'classes_': _var16, 'n_iter_': _var20, 'coef_': _var22, 'intercept_': _var24, '_sklearn_version': '1.1.1'})
_var26 = Pipeline()
_var26.__setstate__({'steps': [('0', _var12), ('clf', _var25)], 'memory': None, 'verbose': False, '_sklearn_version': '1.1.1'})
result = _var26

Do you think that would be useful for a user? I guess the first few lines containing the imports are interesting, but when I tried it on a function with a local import, that import was not listed, so it is limited.

BenjaminBossan avatar Jul 18 '22 14:07 BenjaminBossan

I'd say to start, we can mask the large array values and only show their size. For the purpose of them seeing what's happening, the values in the matrix don't really matter. The rest is somewhat readable, for instance one knows that the output is a Pipeline. WDYT?

Of course this can be improved, but it should be a good start.

adrinjalali avatar Jul 19 '22 08:07 adrinjalali

I think now that we have an alternative persistence model, this doesn't make much sense. We can focus our efforts on our format.

adrinjalali avatar Jan 24 '23 16:01 adrinjalali