goodtables.io icon indicating copy to clipboard operation
goodtables.io copied to clipboard

Forbid `duplicate-row` check for large files

Open roll opened this issue 8 years ago • 5 comments

Overview

goodtables-py is O(1) memory if duplicate-row check is disabled. So we can't allow at the same time to validate a big file without limits and this check enabled.

roll avatar Dec 18 '17 16:12 roll

I don’t agree. The user needs to decide what memory constraints apply for her use case, and what is considered a ‘big’ file. We shouldn’t forbid anything, but we should indicate which checks incur greater than O(1) memory usage, so the user can make an informed decision.

pwalsh avatar Dec 18 '17 20:12 pwalsh

@pwalsh @amercader I'm not sure we're in sync here. Here I mean only our web service goodtables.io. I'm pretty sure we can't allow users to run our servers out of memory. That's just a security hole.

What's we could do is to allow users to choice:

  • validate super big files without duplicate-row check OR
  • validate with duplicate-row check but using reasonable row limit

roll avatar Dec 20 '17 13:12 roll

I'm not sure we allow super big files for now anyway (@amercader?). But this issue is something to take into account. I've just completely missed it out at first place. But Adria's question yesterday has helped my to remember that we have checks linear to memory consumption.

roll avatar Dec 20 '17 13:12 roll

@pwalsh Yes, this is the service, not the library. @roll let's disable this check for now

amercader avatar Dec 20 '17 21:12 amercader

This is an important check, used by consumers of the service, including try.goodtables.io and OpenSpending, both of which use this check.

In order to decide when and how this check should be disabled, can you two please define:

  1. What is a big file (what is "big" in these circumstances).
  2. What our expected/supported concurrency loads are (as this is surely as important to memory consumption as file size).
  3. What type of configuration of the app can be used to control this.

Then, the user of this app (OKI, in the case of the goodtables.io deployment) can configure to her needs.

pwalsh avatar Dec 21 '17 06:12 pwalsh