eagle icon indicating copy to clipboard operation
eagle copied to clipboard

Get rid of `pickle`

Open KOLANICH opened this issue 3 years ago • 5 comments

Pickle is a security issue.

KOLANICH avatar Jul 03 '22 09:07 KOLANICH

Hi KOLANICH, thanks for pointing this out.

If using pickle is unacceptable to you due to the related security concerns, it would be helpful if you explained in a bit more details your situation - for example, is it a general problem regardless of circumstances or just in certain situations?

While I understand and agree with you that pickle, in general, is not safe by design, please bear in mind that this is a research project and is not intended to be used in production (as a research lab we do not have enough manpower to maintain production-grade software). Consequently, we assume things are run in a trusted and self-contained environment - I believe under this assumption using pickle should not be a threat, but I'm happy to re-consider this if there is a genuine case where it is a problem.

vaenyr avatar Jul 04 '22 05:07 vaenyr

If this is about our dataset, we certainly can release it in whatever alternative format would suit you best - just let me know what your preference is.

vaenyr avatar Jul 04 '22 05:07 vaenyr

Hi, thank you for responding.

is it a general problem regardless of circumstances or just in certain situations?

Usually it is a general problem. If one can avoid them, one probably should.

  1. If something executes arbitrary code, that code should be explicitly visible and auditable, not concealed within pickle blobs.
  2. In a lot of cases choice of a code-executing format over the one which is safe to deserialize is unjustified and puts additional burden on the ones who gonna open it: they have to use the means to reverse engineer pickles instead of just loading the data.

To be honest I myself has used pickles, but only for locally-created cache in the circumstances when every other serialization method turned out to be much slower. But distributing premade pickles is surely not good and should be avoided.

Consequently, we assume things are run in a trusted and self-contained environment

If I understand right, "contained environment" is an impossible thing, unless a separate real disposible air-gapped machine is used. The code is Turing complete, it is impossible to create a program returning 1 if any code does unauthorized access, like exploiting flaws in OS, sandboxes, VMs and hardware and 0 if it does authorized things.

If this is about our dataset, we certainly can release it in whatever alternative format would suit you best - just let me know what your preference is.

I guess, TSV. Or maybe a custom bihary format made of stream of records like

seq:
  - id: header
    type: header
  - id: records
    type: record
    repeat: expr
    repeat-expr: header.count
types:
  header:
    seq:
      - id: version
        type: version
      - id: name
        type: strz
      - id: count
        type: u4
    types:
      version:
        seq:
          - id: major
            type: u1
          - id: minor
            type: u1
  record:
    seq:
      - id: latency
        type: f8
      - id: len
        type: u1
      - id: ops
        type: u1
        enum: op
        repeat: expr
        repeat-expr: len
    enums:
      op:
        0: none
        1: skip_connect
        2: nor_conv_1x1
        3: nor_conv_3x3
        4: avg_pool_3x3
        5: input
        6: output
        7: global

in Kaitai Struct syntax (can be compiled into parsing code).

KOLANICH avatar Jul 04 '22 08:07 KOLANICH

Hi, apologies for a delayed response - I was traveling and only came back a couple of days ago.

So, as far as I understand your main concern is related to the dataset that is being redistributed - here I completely agree that pickle might not be the best choice so be assured I will soon upload a safer alternative. (Hopefully within the next couple of days but worst case scenario might happen early August.)

When it comes to my comment about execution environment I was referring to the default workflow of our tool, which allows you to produce your own measurements and then use them to train some predictors - we use pickle to save and then load results in different parts of our codebase and in my opinion this particular use case does not introduce significant issues as the user is both the producer and the consumer of pickle data. Under this scenario any security issues related to using pickle in our codebase would need to assume that your system had already been compromised so I am leaning towards saying it is irrelevant.

vaenyr avatar Jul 15 '22 10:07 vaenyr

Hi, apologies for a delayed response - I was traveling and only came back a couple of days ago.

No problem, I also was on vacation ;)

So, as far as I understand your main concern is related to the dataset that is being redistributed - here I completely agree that pickle might not be the best choice so be assured I will soon upload a safer alternative.

Yeah. Thank you.

KOLANICH avatar Jul 24 '22 19:07 KOLANICH