data-lake-as-code icon indicating copy to clipboard operation
data-lake-as-code copied to clipboard

OpenTargets dataset update in the S3 buckets

Open DSuveges opened this issue 2 years ago • 0 comments

Hi Guys,

I'm form OpenTargets. One of our users reported that OT data fetched from S3 has some problem: the data seems to have unexplainable duplication. We believe the problem might due to how the data is synced from EBI ftp. The datasets our pipelines generated via spark are partitioned into smaller chunks with filenames containing a release specific hash. As the hash is different from release to release, the line below probably will not overwrite the content of the S3 buckets, instead, these chunks keep accumulating.

https://github.com/aws-samples/data-lake-as-code/blob/50f57f5b4b81773dfd0a67ab393fe10285899277/scripts/ssmdoc.import.opentargets.latest.json#L30

For more details, please see the issue in our tracker.

DSuveges avatar Jul 11 '22 20:07 DSuveges