sivirep Investigate performance and alternatives to clean one-off issues in upstream data

Investigate performance and alternatives to clean one-off issues in upstream data

Open Bisaloo opened this issue 1 year ago • 2 comments

Relevant function is here:

https://github.com/epiverse-trace/sivirep/blob/e91eea2f25facf540be4f1f92c4dd68af5a70853/R/cleaning_data.R#L413-L438

Can we avoid the eval(parse()) and use a design that would allow users to plug in their own data file of issues (excel or other)?

Feb 21 '24 14:02 Bisaloo

Do we know for sure if datasets are stable / frozen once they are uploaded to SIVIGILA, @GeraldineGomez? We could store a list of fingerprints for each dataset in sivirep to ensure this is always the case.

If so, the simplest option may be to hardcode the specific row numbers we want to exclude for each event.

What do you think?

Feb 28 '24 15:02 Bisaloo

Hi @Bisaloo,

The datasets aren't frozen; they've updated the structure in some cases. Last year, they added three new columns, and the structure depends on the event itself and the year. I've attempted to create unified files with the rules, validations, and exceptions that sivirep needs to consider for cleaning the data. Those are like a map and integrated the conditions from the data dictionary of NIH:

I prioritized them with the columns that sivirep uses to generate the analysis and included the key columns related to the 'Codification of Events in the SIVIGILA document' to simplify the validations. Currently, I'm not taking the year into account as a variable, but it's important, especially because the codification is different in some years, particularly in 2012 and 2016.

I'm not sure if hardcoding these conditions is the best option, We would need to add N conditions for each year and maintain their growth or changes that NIH produces.

Perhaps an option is to generate the conditions from those files, and improve the performance in terms of expression/condition evaluation?

Mar 05 '24 05:03 GeraldineGomez

Hi @Bisaloo,

The strategy developed with the NIH was to create a "dictionary" in the configuration file with some of the most relevant validations and requirements for each disease. For any error counter, we report it to the NIH to help improve their data cleaning routine.

I will proceed to close this issue.

Many thanks for the discussion and for highlighting some key aspects.

Feb 26 '25 15:02 GeraldineGomez

sivirep sivirep copied to clipboard

Investigate performance and alternatives to clean one-off issues in upstream data

sivirep
sivirep copied to clipboard