fhir-data-pipes
fhir-data-pipes copied to clipboard
Added FHIR Sink Config Properties to the Pipeline Controller
Description of what I changed
Fixes https://github.com/google/fhir-data-pipes/issues/825
Added Fhir Sink Config Properties to the Pipeline Controller to enable Synching data from a FHIR server to another FHIR server
fhirSinkPath: ""
sinkUserName: "hapi"
sinkPassword: "hapi123"
E2E test
Adede e2e tests for synching from a hapi fhir sever to another using the pipeline controller for both FULL and INCREMENTAL modes
TESTED:
Successfully fetched data form OpenMRS and a HAPI FHIR server to another HAPI FHIR server for both batch and incremental pipeleines
Checklist: I completed these to help reviewers :)
-
[x] I have read and will follow the review process.
-
[x] I am familiar with Google Style Guides for the language I have coded in.
No? Please take some time and review Java and Python style guides.
-
[x] My IDE is configured to follow the Google code styles.
No? Unsure? -> configure your IDE.
-
[x] I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)
-
[x] I ran
mvn clean package
right before creating this pull request and added all formatting changes to my commit. -
[x] All new and existing tests passed.
-
[x] My pull request is based on the latest changes of the master branch.
No? Unsure? -> execute command
git pull --rebase upstream master
@bashir2 , Requesting for your review here. Fixed the e2e tests
hello @chandrashekar-s , After pulling in your changes from https://github.com/google/fhir-data-pipes/pull/935
Am now getting an error when running the e2e tests
"Run E2E Test for Dockerized Controller and Spark Thriftserver": java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/workspace/e2e-tests/controller-spark/dwh/controller_DWH_TIMESTAMP_2024_02_13T19_50_30_899562839Z/Encounter/ConvertResourceFn_Encounter_output-parquet-th-517-ts-1707853844285-r-760691.parquet is not a Parquet file (too small length: 4)
.
All I do in this PR is to sync data to both a FHIR sink and Parquet dwh at the same time. The e2e tests initially ran fine before pulling in your changes for fixing High memory usage
Hi @mozzy11, usually any errors that occur in the pipeline run are not printed in the logs since the docker is ran in the detached mode. You can use this code which was recently added by @bashir2 to debug the errors, I haven't tried it myself yet though.
@chandrashekar-s increasing the rowGroupSizeForParquetFiles to 100mb
seems to have solved my error
@mozzy11 Thanks for the update. Currently, the build fails sometimes due to some issues which are mentioned in this ticket. So, the build fails sometimes and retriggering usually works.
Regarding the parameter rowGroupSizeForParquetFiles
increasing to 100mb, there is a follow up PR which needs to be merged for it to fully take effect. So, this parameter should not affect the build.
Do you mind undoing the changes for this parameter and retriggering a build.
@chandrashekar-s , youre right ,
The build passes even with reverting the rowGroupSizeForParquetFiles
property
@bashir2 do you have any comments for this PR ??
@bashir2 do you have any comments for this PR ??
Yeah I was just reviewing this (after the e2e-tests passed); sorry that debugging our e2e-test failures is not trivial.
Codecov Report
Attention: Patch coverage is 42.85714%
with 4 lines
in your changes are missing coverage. Please review.
Project coverage is 50.49%. Comparing base (
1b7199b
) to head (68fb9ec
).
Files | Patch % | Lines |
---|---|---|
...java/com/google/fhir/analytics/DataProperties.java | 42.85% | 3 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #947 +/- ##
============================================
- Coverage 50.56% 50.49% -0.07%
+ Complexity 674 673 -1
============================================
Files 91 91
Lines 5512 5519 +7
Branches 708 709 +1
============================================
Hits 2787 2787
- Misses 2464 2470 +6
- Partials 261 262 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hello @bashir2 , Generally ive noticed the tests consistently fail when i run controller pipelines to sync data to a parquest DWH and a FHIR sink simultaneously , basically because it consumes too much reaources and the CI build fails.
Ive just adedd empty params for the fhir sink so that data is not synced to fhir sink to a by default . Ill work on another PR to to enable syncing data to a FHIR sink independently wothout dependening on the Parquet DWH
Hello @bashir2 , Generally ive notoced the tests consistently fail when i run controller pipelines to sync data to a parquest DWH and a FHIR sink simultaneously , basically because it consumes too much reaources and the CI build fails.
Thanks @mozzy11 for working on this. What evidence do you have that the root cause of controller failures is a resource starvation issue? Asking because I took a look at some of your failed e2e runs and it was not obvious to me that is the reason (for example this one from last week). Yes, it seems that the controller fails but I am not sure if it is due to resource issues. Can you please reproduce that scenario then following the instructions in this comment to see the logs from the controller?
Ive just adedd empty params for the fhir sink so that data is not synced to fhir sink to a by default . Ill work on another PR to to enable syncing data to a FHIR sink independently wothout dependening on the Parquet DWH
I prefer if we have e2e tests that cover this feature, especially because it is a little bit different than usual scenarios we test (for analytics queries); otherwise we may break this in the future without noticing.
BTW, please also respond to the comments from the last review and resolve those that are addressed.
@bashir2 i investigated the cause of the failure and fixed it.
CC @bashir2
Thanks @mozzy11 for debugging and the changes; I'll take a look soon. In the dev. call tomorrow, do you mind discussing this and the FHIR-server to FHIR-server sync scenario in general? I just want to make sure we are on the same page regarding the shortcomings of the current approach and your plans for how to use this in production.
Thanks @bashir2 . i will discuss