rcv icon indicating copy to clipboard operation
rcv copied to clipboard

Update Audit Log format (and maybe file names)

Open HEdingfield opened this issue 7 years ago • 10 comments

After doing a single big run (e.g. config_minneapolis_2013_scale.json), the audit log is too big for even Notepad++ to open (700 MB!). We need to figure out a better way to generate these; maybe splitting at 50 MB or some other more reasonable number. Recommend looking up limits for basic text editors and complying with the lowest one.

HEdingfield avatar Aug 22 '18 23:08 HEdingfield

@CalebKleppner do you have a take on this? I just ran the scale test (by disabling the duplicate CVR check) and the new cleaner audit log format clocks in at ~252MB. Is it worth breaking this up into multiple output files?

moldover avatar Sep 17 '18 14:09 moldover

Yes, we need to break them up into smaller chunks.  Any maybe create 3 separate log files:

  1. full ballot info (rankings, precincts) and how ballot was counted in round 1,
  2. a file containing all transfers (round 3, ballot id 12345, transferred from XX to YY, weight .37), and
  3. a file showing how each ballot counted in each round  (ballot ID, counted in round 1, counted in round 2, etc).

Or something like that so we don't have such huge files.  One and 3 would probably be pretty big, but should be easy to import into a database program.

CalebKleppner avatar Sep 17 '18 20:09 CalebKleppner

@CalebKleppner @tarheel

Our understanding was to replace the existing audit format with what you are describing as .2 here. I spent several hours on this Saturday: https://github.com/BrightSpots/rcv/issues/146 It seems like now you're saying you want BOTH formats in two or (three?) log files, which was not something we discussed. Please be really specific about what you're looking for so we can build the right thing.

moldover avatar Sep 18 '18 05:09 moldover

And we were happy not to have to support (3) anymore, because having to keep track of that info dramatically increases the memory footprint of the software and slows it down.

tarheel avatar Sep 18 '18 05:09 tarheel

Sorry, really was just thinking out loud about multiple audit files.  I wasn't trying to give specific direction.

So I'd say that any file that exceeds 50MB should probably be broken into chunks.

Jon, how 'bout if you give me a link to your 252MB audit log from your scale test and I check it out for usability, etc.  Then we can think more about the information and format we really want for the log files.

PS I've attached a config file for the scale test that references 13 unique CVR files (with names 2013-minneapolis-mayor-cvr1.xlsx, 2013-minneapolis-mayor-cvr2.xlsx, etc.)  So you just need to copy 2013-minneapolis-mayor-cvr.xlsx 13 times with the right name.

If we do 1 and 2 below in one or two files, then we certainly don't need a separate file for 3.

Possible audit log files for contemplation

  1. full ballot info (rankings, precincts) and how ballot was counted in round 1,
  2. a file containing all transfers (round 3, ballot id 12345, transferred from XX to YY, weight .37), and
  3. a file showing how each ballot counted in each round  (ballot ID, counted in round 1, counted in round 2, etc).

CalebKleppner avatar Sep 18 '18 14:09 CalebKleppner

Ok. We can make sure any log file bigger than 50MB gets split up.

Here's the latest output from the fake Mayor race. I think this will be easier to work with than the scale output.

The lines which include "FINE" are specific to the cvrs. (FINE is a logging level, a mechanism for routing log data in the program) We are using the cvr input filename(s) to generate cvr IDs (since none are provided in this file) which is why you have a lot of "cvr_2015_portland_mayor.xlsx(1)" and so on.

2018-09-17_21-51-43_audit.log

moldover avatar Sep 18 '18 14:09 moldover

Fixed.

moldover avatar Sep 25 '18 16:09 moldover

I'm going to re-purpose this issue to keep the history since it's more relevant to updates to audit log format.

The other interesting thing Hylton "discovered" is that for multi-file audits (bigger than 50MB) you have log generation names which are reverse to what you'd expect:

2018-09-25_08-16-39_audit_2.log <--- audit begins here 2018-09-25_08-16-39_audit_1.log 2018-09-25_08-16-39_audit_0.log <--- ends here

Fixing through the logging module doesn't seem to me a good option. To get the desired level of control I think we'd need to write to a single file and split it up afterwards. That's a chunk of work, and I really think we need a definitive design to the overall audit format from @CalebKleppner first.

moldover avatar Sep 25 '18 16:09 moldover

George and I discussed audit log formats and want to play around with some actual files to see if we can come up with log formats that are best-suited for how election administrators and election auditors will sue them. Let's not play around with the actual file format any more until George and I get back to you. As for the method of breaking up the audit logs into 50MB chunks and the file names that result, we'll either live with decreasing numbers or you'll figure out how to fix that. But not pressing right now.

CalebKleppner avatar Sep 26 '18 20:09 CalebKleppner

SLI has flagged an issue where any validation errors result in no audit log. This was done because we need to parse the audit log location from the config before we can create the audit log (so there is some reason for this) however we should explore adding validation errors to the audit or finding a better way to communicate this to the user.

moldover avatar Apr 25 '21 07:04 moldover