Get rid of `pickle`
Pickle is a security issue.
Hi KOLANICH, thanks for pointing this out.
If using pickle is unacceptable to you due to the related security concerns, it would be helpful if you explained in a bit more details your situation - for example, is it a general problem regardless of circumstances or just in certain situations?
While I understand and agree with you that pickle, in general, is not safe by design, please bear in mind that this is a research project and is not intended to be used in production (as a research lab we do not have enough manpower to maintain production-grade software). Consequently, we assume things are run in a trusted and self-contained environment - I believe under this assumption using pickle should not be a threat, but I'm happy to re-consider this if there is a genuine case where it is a problem.
If this is about our dataset, we certainly can release it in whatever alternative format would suit you best - just let me know what your preference is.
Hi, thank you for responding.
is it a general problem regardless of circumstances or just in certain situations?
Usually it is a general problem. If one can avoid them, one probably should.
- If something executes arbitrary code, that code should be explicitly visible and auditable, not concealed within pickle blobs.
- In a lot of cases choice of a code-executing format over the one which is safe to deserialize is unjustified and puts additional burden on the ones who gonna open it: they have to use the means to reverse engineer pickles instead of just loading the data.
To be honest I myself has used pickles, but only for locally-created cache in the circumstances when every other serialization method turned out to be much slower. But distributing premade pickles is surely not good and should be avoided.
Consequently, we assume things are run in a trusted and self-contained environment
If I understand right, "contained environment" is an impossible thing, unless a separate real disposible air-gapped machine is used. The code is Turing complete, it is impossible to create a program returning 1 if any code does unauthorized access, like exploiting flaws in OS, sandboxes, VMs and hardware and 0 if it does authorized things.
If this is about our dataset, we certainly can release it in whatever alternative format would suit you best - just let me know what your preference is.
I guess, TSV. Or maybe a custom bihary format made of stream of records like
seq:
- id: header
type: header
- id: records
type: record
repeat: expr
repeat-expr: header.count
types:
header:
seq:
- id: version
type: version
- id: name
type: strz
- id: count
type: u4
types:
version:
seq:
- id: major
type: u1
- id: minor
type: u1
record:
seq:
- id: latency
type: f8
- id: len
type: u1
- id: ops
type: u1
enum: op
repeat: expr
repeat-expr: len
enums:
op:
0: none
1: skip_connect
2: nor_conv_1x1
3: nor_conv_3x3
4: avg_pool_3x3
5: input
6: output
7: global
in Kaitai Struct syntax (can be compiled into parsing code).
Hi, apologies for a delayed response - I was traveling and only came back a couple of days ago.
So, as far as I understand your main concern is related to the dataset that is being redistributed - here I completely agree that pickle might not be the best choice so be assured I will soon upload a safer alternative. (Hopefully within the next couple of days but worst case scenario might happen early August.)
When it comes to my comment about execution environment I was referring to the default workflow of our tool, which allows you to produce your own measurements and then use them to train some predictors - we use pickle to save and then load results in different parts of our codebase and in my opinion this particular use case does not introduce significant issues as the user is both the producer and the consumer of pickle data. Under this scenario any security issues related to using pickle in our codebase would need to assume that your system had already been compromised so I am leaning towards saying it is irrelevant.
Hi, apologies for a delayed response - I was traveling and only came back a couple of days ago.
No problem, I also was on vacation ;)
So, as far as I understand your main concern is related to the dataset that is being redistributed - here I completely agree that pickle might not be the best choice so be assured I will soon upload a safer alternative.
Yeah. Thank you.