briefcase
briefcase copied to clipboard
Review how binary submission attachments are exported
Related to this forum post Related to issue #665
The problem
- We start the audit file sequence on 2 and we check if there’s already an audit-2.csv file. If it exists, we increment the sequence and repeat. This means that the 200th submission does 200 fs checks. The 2000th, 2000 checks, etc.
- Since the export is parallelized, a check could happen while the file is writing and collisions happen.
- I think we should review how all this works because the behavior is kind of confusing.
- If the file doesn’t exist, we copy it << concurrency issue here bc the file could be being written in another thread, which always happens with audit files.
- If the file exists and it has the same hash, we don’t do anything. Hashing is not cheap!
- If the file exists and hashes are different, we add a suffix, regardless of the "overwrite files" setting. I’m assuming this is done to resolve name collisions between files from different forms.
Sequencing algorithm
This is a rough step-by-step description of how the binary mapper works:
-
Get the filename from the field's value
-
Create the output
media
directory if it doesn't already exist -
If the source file (in the storage directory) doesn't exist, do nothing, and end.
-
If the target file (in the export
media
directory) doesn't exist, copy the source file to the outputmedia
directory, and end. -
If the target file exists and it has the same hash as the source file, do nothing, and end.
-
If the target file exists and it doesn't have the same hash as the source file, copy the file adding a suffix with the next free sequence number to its filename at the target output
media
directory, and end.We find the next free sequence number by, starting with 2, checking if the file that would result already exists or not.
For example, if
audit-2.csv
, andaudit-3.csv
exist, we would make 3 checks until we can determine that4
is the next free sequence number becauseaudit-4.csv
doesn't exist.
This algorithm has issues when the same filename with different contents appears in more than one submission of the same form. Repeated exports will produce duplicated output files- The following screenshot shows how after three exports of a form, the second image gets repeated, adding one extra image per export:
Options we could explore:
- Synchronize the binary mapper. Tradeoff: bad for performance
- Use UUID (instance ID) suffixes (which can be safely used in parallel processes). Tradeoff: long filenames
- Use an AtomicInteger and enforce overwrite files. Tradeoff: since we always start on 2, using the same export location twice, or selecting more than one form with audit data will overwrite files.