Tangerine
Tangerine copied to clipboard
CSV outputs show separate/duplicate columns if a variable moves from one section to another due to a form change
Issue
If a change is made to a form that moves a variable to a different section, then the CSV output will show that variable in two columns. This is not necessarily a bug, but Tangerine should handle this scenario.
CSV outputs are designed to show variables by section so they follow the data dictionary. Changes to the structure of the sections and variables in different versions of the form will change the order of the headers in the CSV outputs.
Example
Form version one has the variable held
in the section Sought
{
"_id": "abc",
"formId": "form-1",
"formVersion": "v1",
"form-06b414cc-4971-46da-b121-fd3362e8d1f6.item_Sought.held": "0",
}
Form version two has the variable held
in the section Crime
{
"_id": "def",
"formId": "form-1",
"formVersion": "v1",
"form-06b414cc-4971-46da-b121-fd3362e8d1f6.item_Crime.held": "0",
}
The CSV output for this form will be:
_id | formId | formVersion | item_Sought_disabled | held | item_Crime_disabled | held |
---|---|---|---|---|---|---|
abc | form1 | v1 | FALSE | 0 | FALSE | UNDEFINED |
def | form1 | v2 | FALSE | UNDEFINED | FALSE | 0 |
Considerations
-
Solutions to the issue will need to consider how to implicitly infer a form version from the csv-reporting metadata
- The form versioning feature of Tangerine is usually implemented since there is no UI. A solution
- The form version could be assumed using git history of the form file
-
Solutions will also need to consider the impact on the ordering of sections and variables in the outputs
- Simply combining the variable into one column breaks the current order of the variables into csvs
-
MySQL outputs do not have this issue since duplicate variable are not allowed
Possible Solutions
- Add a UI option to output CSVs by version. One CSV file per Form Version
- Add a UI option to output CSVs as a distict set of variables (instead of in data dictionary order)
@esurface - are you certain this is in fact the current behavior? I don't think my experience has reflected this. (The same varname showing up in multiple columns.)
Also wanted to confirm - the illustrative JSON for the second block still says "formVersion": "v1"
, while the illustrative CSV output says formVersion
is v2
. Is that a typo, or are you suggesting that there be some manner of auto-incrementing happening? (I'm assuming the former - I don't think an 'automagical' auto-increment would be the ideal way to go.)
In re: "breaking the order of variables into CSVs" - this is already somewhat broken, in that late-added variables get appended to the end of the CSV column list rather than actually being inserted alongside their neighbors in the instrument proper.
For instance, if I've generated data for an instrument having SectionA.item1-SectionA.item10
, SectionB.item1-SectionB.item10
, and SectionC.item1-SectionC.item10
in that order, and then I add the variable SectionA.item11
, that new variable will wind up as the 31st item in the column list rather than the 11th. (Ignoring all the metadata columns for the purposes of this example.)
If you want to fix that, that would be cool. But the current reality doesn't seem to match what you're describing under bullet 2.