briefcase icon indicating copy to clipboard operation
briefcase copied to clipboard

Review how binary submission attachments are exported

Open ggalmazor opened this issue 6 years ago • 0 comments

Related to this forum post Related to issue #665

The problem

  • We start the audit file sequence on 2 and we check if there’s already an audit-2.csv file. If it exists, we increment the sequence and repeat. This means that the 200th submission does 200 fs checks. The 2000th, 2000 checks, etc.
  • Since the export is parallelized, a check could happen while the file is writing and collisions happen.
  • I think we should review how all this works because the behavior is kind of confusing.
    • If the file doesn’t exist, we copy it << concurrency issue here bc the file could be being written in another thread, which always happens with audit files.
    • If the file exists and it has the same hash, we don’t do anything. Hashing is not cheap!
    • If the file exists and hashes are different, we add a suffix, regardless of the "overwrite files" setting. I’m assuming this is done to resolve name collisions between files from different forms.

Sequencing algorithm

This is a rough step-by-step description of how the binary mapper works:

  1. Get the filename from the field's value

  2. Create the output media directory if it doesn't already exist

  3. If the source file (in the storage directory) doesn't exist, do nothing, and end.

  4. If the target file (in the export media directory) doesn't exist, copy the source file to the output media directory, and end.

  5. If the target file exists and it has the same hash as the source file, do nothing, and end.

  6. If the target file exists and it doesn't have the same hash as the source file, copy the file adding a suffix with the next free sequence number to its filename at the target output media directory, and end.

    We find the next free sequence number by, starting with 2, checking if the file that would result already exists or not.

    For example, if audit-2.csv, and audit-3.csv exist, we would make 3 checks until we can determine that 4 is the next free sequence number because audit-4.csv doesn't exist.

This algorithm has issues when the same filename with different contents appears in more than one submission of the same form. Repeated exports will produce duplicated output files- The following screenshot shows how after three exports of a form, the second image gets repeated, adding one extra image per export:

image

Options we could explore:

  • Synchronize the binary mapper. Tradeoff: bad for performance
  • Use UUID (instance ID) suffixes (which can be safely used in parallel processes). Tradeoff: long filenames
  • Use an AtomicInteger and enforce overwrite files. Tradeoff: since we always start on 2, using the same export location twice, or selecting more than one form with audit data will overwrite files.

ggalmazor avatar Oct 17 '18 09:10 ggalmazor