fhir-data-pipes icon indicating copy to clipboard operation
fhir-data-pipes copied to clipboard

Support for storing parquet files in AWS S3 is not available

Open Charantl opened this issue 1 year ago • 1 comments
trafficstars

image

Charantl avatar Apr 03 '24 12:04 Charantl

I think Beam supports S3 natively; we should probably just add a dependency and maybe set some filesystem options to fix this. I will take a look at this soon.

bashir2 avatar May 10 '24 16:05 bashir2

S3 support issue

@bashir2, We are still facing the same issue while applying the S3 location in dwhRootPrefix of docker/config/application.yaml file.

Charantl avatar Jun 20 '24 12:06 Charantl

Thanks @Charantl for the note; when I fixed this last month I only tested it in the base pipeline and made sure S3 output locations work. This error is coming from the controller which I missed. I am going to fix it next week.

bashir2 avatar Jun 21 '24 20:06 bashir2

@Charantl can you please build from HEAD and confirm that the controller/pipelines work with S3 files in your environment?

bashir2 avatar Jul 09 '24 18:07 bashir2

@bashir2 The S3 path location is functioning correctly, and we were able to push the data to S3. However, it appears that the timestamp_start.txt file must be present in the directory specified by dwhRootPrefix. If the timestamp_start.txt file is not located in the specified directory, it results in a NoSuchKeyException error, and the final data is stored in the previous directory. For example, with dwhRootPrefix set to s3://bucket_name/baseDir/childDir/prefix, the application expects timestamp_start.txt in the prefix folder, but the data is being stored in the childDir instead of the prefix directory.

image (3)

Charantl avatar Jul 17 '24 06:07 Charantl

Thanks @Charantl for the report; I am unable to reproduce the problem you are facing. Are you sure that there are no files under s3://bucket_name/baseDir/childDir when you start using the controller? The cases that I tried and all worked fine:

  • Creating a fresh S3 bucket, say new-bucket and setting dwhRootPrefix: "s3://new-bucket/prefix"
  • Doing the same but set dwhRootPrefix: "s3://new-bucket/childDir/prefix"
  • Creating a random file under s3://new-bucket/childDir and setting dwhRootPrefix: "s3://new-bucket/childDir/prefix"

Also are you sure you are using the latest version of the code? Asking because the line numbers do not align with the code at HEAD. For example initialiseLastRunDetails is called on line 241 of PipelineManager.java but in your logs it says 243.

Also the reason I am guessing s3://bucket_name/baseDir/childDir/ is not empty in your case is that the above code branch is triggered (i.e., the stack-trace you sent us). For an empty (or non-existent) "directory" that code branch should not be triggered. (Just to emphasize: The directory being empty is not a requirement and the code should work for non-empty case too, I am just trying to see how to reproduce the problem you have and potentially offer a solution to your case.)

bashir2 avatar Jul 18 '24 21:07 bashir2

@bashir2 Currently, we are able to push data to S3 when the folders are created automatically as specified in the dwhRootPrefix: "s3://bucket-name/baseDir/prefix". Please find below our observations:

  1. Instead of allowing the folders to be created automatically in S3 by setting the path dwhRootPrefix: "s3://bucket-name/baseDir/prefix", we manually created the folders baseDir and prefix, which led to the previously mentioned issue.

  2. Also are you sure you are using the latest version of the code?

    • We cloned the latest code and added some logs to verify.
  3. When creating tables in HIVE using the generated parquet files from S3 by setting createHiveResourceTables to true, the resource tables were not created. A screenshot of the log has been included below: image (4)

  4. Incremental Pipeline: It works as expected when using a local directory path. However, changing the dwh path to S3 and running the incremental pipeline results in the creation of parquet files for the first incremental when there are new changes in the source database. In other cases, we encounter a FileAlreadyExistsException. A screenshot of the log has been included below. image (5)

Charantl avatar Jul 31 '24 06:07 Charantl

An update on this following our conversations: From the list above, 3 and 4 seems to be causing issues, I took a look and here are some notes:

Re. 4) I could not reproduce any incremental run issues due to S3. I ran the controller with dwhRootPrefix set to a S3 location and continuously updated the FHIR server for a few hours. The controller kept running the incremental pipeline on its schedule for several hours, without any issues.

That said, from the logs you shared, it seems that for whatever reason (which needs to be investigation further on your side) one of the incremental runs have failed. And after that the controller has failed to recover. This particular edge-case is something that was indeed a bug for S3 or any other bucket based cloud storage (e.g., GCS). This should be fixed by #1185. If the incremental run failure happened again, please share the error.log with us (available both in the controller UI and under the data-warehouse incremental run path).

Re. 3) I am guessing you are using one of compose-controller*.yaml configs under docker, correct? If that's the case, all of those are provided as examples for a quick and local deployment, i.e., everything, including the pipelines and Spark components run locally. If you want to have a distributed or cloud file-system based deployment, you need to tweak those configurations.

For example, if all you need is to use S3 for storage but run everything locally one a single processor, then I think it is much easier to mount an S3 bucket into the docker containers. It seems here is a plugin to do so (I have not used them myself).

If on the other hand, you want to test horizontal scalability of Spark on your generated Parquet files, then my suggestion is to set createHiveResourceTables to false for now and manually create the required tables in the Thrift server for your distributed Spark deployment. Once you figure that is indeed the approach you want to take, we can work together to add a feature to createHiveResourceTables to support your particular deployment as well.

bashir2 avatar Sep 13 '24 01:09 bashir2