openmrs-module-initializer icon indicating copy to clipboard operation
openmrs-module-initializer copied to clipboard

Support checksums and conditional loading for a given CSV row

Open mseaton opened this issue 4 years ago • 5 comments

Currently, Iniz is able to determine whether a domain resource has changes based on computing checksums at the file level. This works well to reduce the startup time and ensure metadata isn't being re-saved unnecessarily each time Initializer runs.

However, once a file is determined to have changed, there is no further granular checking done to determine what entities defined in the file have actually changed. Given most of these files are CSV files, where one row within the CSV file represents one entity, it is likely quite possible to introduce a finer-level of granularity, and to maintain information to determine which specific rows in a domain CSV file should be processed. By introducing such capabilities at the level of the Base CSV Processor, we may be able to introduce such a feature for the majority of domains without much effort.

The idea would be to support having a checksum generated for each row in a given csv file that could be used to determine if that row has been changed. An additional file under configuration_checksums/domain that contains a properties file of = would suffice. This would only generate and verify a checksum if a Uuid has been specified on a given row. During processing of a given csv line, the uuid would be retrieved, and if it is not null, a checksum of the entire row would be generated and compared against the last saved checksum for that row, and if it is different then the row would be processed and the new checksum would be saved. Any changes to the header row would also need to be accounted for.

In particular, this would likely speed up processing of domains when there is a large amount of metadata (eg. it could better enable maintaining a single concepts.csv for a dictionary), and would allow better logging of exactly what metadata has been updated in a given update.

Interested in others thoughts about the usefulness and viability of this proposal @mks-d / @Ruhanga / @rbuisson / @brandones / @mogoodrich

mseaton avatar Oct 04 '21 12:10 mseaton

I think that the usual pattern for addressing this problem is to split into multiple CSVs. This seems like it would be more maintainable, anyway.

brandones avatar Oct 04 '21 17:10 brandones

My gut would be to agree with @brandones ... not sure how valuable it would be to individually verify each line vs the overhead. Do we have a case where a single file is very slow to load (and wouldn't be better to be broken up into multiple files like Brandon suggests)?

mogoodrich avatar Oct 04 '21 19:10 mogoodrich

As far as performance is concerned, Concepts is the obvious area, and this is specifically looking towards improving our concept import loading times. But it is also targeting something that I have always wanted to improve - namely, that we should not execute an API save/update on metadata if it has not changed. The reasons for this are many, but in particular there can be lots of downstream implications to saving metadata, particularly in systems like ours that use a lot of AOP and Hibernate Interceptors to kick off events, sync data, etc if certain methods are executed, and "false updates" could happen with things like dateChanged, changedBy, etc.

I guess what I'd throw back to you @mogoodrich and @brandones is - why specifically are you opposed to this? Do you have specific concerns that this would be adversely impact Iniz or your experience with it in some way? What is the basis for your opposition?

As far as what is "maintainable", people can still choose to have one CSV or multiple CSVs, this doesn't force either. But in my long experience with MDS packages, splitting concepts across them is done for performance reasons only, and has the adverse effect that one never knows what package a given concept is in, or if a concept is in multiple packages, etc. So, I'd like to hear some more convincing arguments against this feature.

mseaton avatar Oct 04 '21 19:10 mseaton

I'm not strongly opposed to it... risks would be a little more complexity and potential bugs introduced when adding.

mogoodrich avatar Oct 04 '21 20:10 mogoodrich

Yeah, I'm not really opposed, the question is whether the value add merits the complexity. The biggest difference with MDS is that you can grep CSVs, so it's trivial to find out where concepts are. The point about there being downstream implications to saving sounds like it would be compelling, but I'm not familiar with that stuff, so I don't know of any examples. Interested to hear what the Mekom team says.

brandones avatar Oct 05 '21 00:10 brandones

@mseaton is this still current or we can close this issue?

mks-d avatar Jun 28 '24 13:06 mks-d

It is still current, in that it has not been implemented. I leave it to you @mks-d to decide whether to close this as Won't Fix. I made whatever case I need to make in the comments above, and others have made reasonable arguments that this adds unnecessary complexity. So we could leave it open as a "maybe someday" or close as won't fix. I don't plan on tackling this at PIH, probably ever.

mseaton avatar Jun 28 '24 14:06 mseaton

@mseaton I will give it a label, which means that it's been considered to be a valid feature and I will also close it as "unplanned".

mks-d avatar Jun 28 '24 14:06 mks-d