data-lake-as-code
data-lake-as-code copied to clipboard
OpenTargets dataset update in the S3 buckets
Hi Guys,
I'm form OpenTargets. One of our users reported that OT data fetched from S3 has some problem: the data seems to have unexplainable duplication. We believe the problem might due to how the data is synced from EBI ftp. The datasets our pipelines generated via spark are partitioned into smaller chunks with filenames containing a release specific hash. As the hash is different from release to release, the line below probably will not overwrite the content of the S3 buckets, instead, these chunks keep accumulating.
https://github.com/aws-samples/data-lake-as-code/blob/50f57f5b4b81773dfd0a67ab393fe10285899277/scripts/ssmdoc.import.opentargets.latest.json#L30
For more details, please see the issue in our tracker.