pathling icon indicating copy to clipboard operation
pathling copied to clipboard

Pathling on spark cluster hang

Open liquid36 opened this issue 10 months ago • 3 comments

Hi! I'm trying to use an Spark Cluster but i could not make it work. This is my docker-compose.yaml

version: "3.8"

services:
  pathling:
    image: aehrc/pathling
    container_name: pathling_server
    ports:
      - "9090:9090"
    environment:
      spark.master: spark://spark-master:7077
      server.port: 9090
      pathling.terminology.enabled: true
      spark.executor.memory: 1g
      pathling.storage.databaseName: test
      # pathling.terminology.verboseLogging: true
      pathling.terminology.cache.defaultExpiry: 3600
      # pathling.terminology.serverUrl: http://localhost:9191/fhir
      pathling.terminology.acceptLanguage: es
      JAVA_TOOL_OPTIONS: >
        -Xmx8g -XX:MaxMetaspaceSize=400m -XX:ReservedCodeCacheSize=240m -Xss1m 
        -Duser.timezone=UTC --add-exports=java.base/sun.nio.ch=ALL-UNNAMED 
        --add-opens=java.base/java.net=ALL-UNNAMED
    volumes:
      - ./.data:/usr/share/staging
      - ./.warehouse:/usr/share/warehouse

  spark-master:
    image: bitnami/spark:3.4.1
    container_name: spark-master
    hostname: spark-master
    ports:
      - "8280:8080" # UI de Spark Master
      - "7077:7077" # Puerto de conexión para Workers
    environment:
      - SPARK_MODE=master

  spark-worker-1:
    image: bitnami/spark:3.4.1
    container_name: spark-worker-1
    depends_on:
      - spark-master
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=8g
    ports:
      - "8081:8081" # UI de Worker 1

  spark-worker-2:
    image: bitnami/spark:3.4.1
    container_name: spark-worker-2
    depends_on:
      - spark-master
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=8g
    ports:
      - "8082:8081" # UI de Worker 2

Pahtling logs

15:37:57.148 [main] [] INFO  au.csiro.pathling.PathlingServer - Starting PathlingServer using Java 17.0.11 with PID 1 (/app/classes started by root in /)
15:37:57.151 [main] [] INFO  au.csiro.pathling.PathlingServer - The following 2 profiles are active: "core", "server"
15:38:02.514 [main] [] WARN  o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15:38:05.682 [main] [] WARN  o.a.s.s.c.a.SimpleFunctionRegistry - The function date_diff replaced a previously registered function.
15:38:05.892 [main] [] INFO  a.c.p.i.CacheableFileSystemPersistence - Querying latest snapshot from database: file:///usr/share/warehouse/test
15:38:06.085 [main] [] INFO  au.csiro.pathling.PathlingVersion - Pathling build version: 7.0.1+39bc091
15:38:11.911 [main] [] INFO  au.csiro.pathling.fhir.FhirServer - FHIR server initialized
15:38:12.516 [main] [] INFO  au.csiro.pathling.PathlingServer - Started PathlingServer in 16.109 seconds (process running for 17.101)
15:38:31.229 [qtp1558397083-95] [rssyWRJu1ppJAYmx] INFO  a.c.pathling.update.ImportExecutor - Received $import request
15:38:35.475 [qtp1558397083-95] [rssyWRJu1ppJAYmx] INFO  a.c.pathling.update.ImportExecutor - Importing Claim resources (mode: overwrite)
15:38:39.752 [dag-scheduler-event-loop] [] WARN  o.a.spark.scheduler.DAGScheduler - Broadcasting large task binary with size 1009.9 KiB

Spark Logs

25/02/03 15:37:33 WARN Master: Got status update for unknown executor app-20250203153727-0001/0
25/02/03 15:37:33 WARN Master: Got status update for unknown executor app-20250203153727-0001/1
25/02/03 15:38:03 INFO Master: Registering app pathling
25/02/03 15:38:03 INFO Master: Registered app pathling with ID app-20250203153803-0002
25/02/03 15:38:03 INFO Master: Start scheduling for app app-20250203153803-0002 with rpId: 0
25/02/03 15:38:03 INFO Master: Launching executor app-20250203153803-0002/0 on worker worker-20250203150828-172.22.0.4-41013
25/02/03 15:38:03 INFO Master: Launching executor app-20250203153803-0002/1 on worker worker-20250203153632-172.22.0.5-37863
25/02/03 15:38:03 INFO Master: Start scheduling for app app-20250203153803-0002 with rpId: 0
25/02/03 15:38:03 INFO Master: Start scheduling for app app-20250203153803-0002 with rpId: 0
25/02/03 15:38:23 WARN Master: Got status update for unknown executor app-20250203153231-0001/0

Nothing happends after made an $import request. How can i get more logs? or are any pathling/spark config that i missing?

liquid36 avatar Feb 03 '25 15:02 liquid36

Hi @liquid36,

So I think what you are trying to achieve is a Pathling server API that farms out the work to a cluster behind the scenes?

The server image can be used as the server, master and worker - it contains Spark plus the Pathling dependencies. So you probably want a server container (using the aehrc/pathling image), and also a number of workers using the same image but configured to launch Spark in worker mode and point to the server/master container.

It looks like you are trying to do this in Docker Compose. This is totally possible, but if you are open to using Kubernetes (which can be just as easy to set up locally using Minikube or Docker Desktop), there is a pre-made solution for you in the Pathling Helm chart.

The way this works is that you spin up a Pathling server container and you give it permission and tell it how to create its own workers to help it. You can configure in how many you would like, how much resources to give them, etc. The Pathling server will manage the resources within the Kubernetes cluster to make this happen.

Here is an example clustering configuration for use with the Pathling Helm chart.

johngrimes avatar Feb 04 '25 04:02 johngrimes

thanks. I made some tries with Kubernetes but i could not configure properly a custom s3 buckets.

    fs.s3a.endpoint: https://us-mia-1.linodeobjects.com
    fs.s3a.access.key: 123456789
    fs.s3a.secret.key: 123456789
    pathling.storage.warehouseUrl: "s3a://"
    pathling.storage.databaseName: pathling

I'm getting this error: Caused by: java.lang.IllegalArgumentException: bucket is null/empty

what am i missing?

liquid36 avatar Feb 04 '25 13:02 liquid36

Do you need the bucket name in the pathling.storage.warehouseUrl variable?

For example, pathling.storage.warehouseUrl: "s3a://mybucket"?

johngrimes avatar Feb 18 '25 05:02 johngrimes

Let us know if you have any more problems, feel free to re-open the ticket.

johngrimes avatar Jun 04 '25 01:06 johngrimes