spark-nlp icon indicating copy to clipboard operation
spark-nlp copied to clipboard

Model not running on GPU

Open BassieWitkin opened this issue 3 years ago • 1 comments

I am training a model on a vm from a cloud based service. I am using an A100x4 machine with 120 vCPUs and 800 gb of RAM. I have a python script that I call MF where I initialize a spark session as follows:

def start(gpu=False, spark23=False):
    current_version="3.4.0"
    maven_spark24 = "com.johnsnowlabs.nlp:spark-nlp_2.12:{}".format(current_version)
    maven_gpu_spark24 = "com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:{}".format(current_version)
    maven_spark23 = "com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:{}".format(current_version)
    maven_gpu_spark23 = "com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:{}".format(current_version)
    builder = SparkSession.builder \
        .appName("sparknlp") \
        .config("spark.pyspark.python","python3") \
        .config('spark.driver.memory',"700G") \
        .config("spark.executor.memory", "700g") \
        .config('spark.shuffle.file.buffer', "1MB")\
        .config('spark.memory.offHeap.enabled', True) \
        .config('spark.memory.offHeap.size', "64g")\
        .config('spark.ui.showConsoleProgress', True) \
        .config('spark.dynamicAllocation.enabled', True)\
        .config('spark.python.worker.memory', "64g")\
        .config('spark.rapids.sql.enabled', True)\
        .config('spark.network.timeout',300000)\
.config('spark.network.timeout',300000)\
        .config('spark.executor.heartbeatInterval',100000)\
        .config('spark.rdd.compress', True)\
        .config("spark.driver.maxResultSize", "6g")\
        .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'")\
        .config("spark.driver.extraJavaOptions", "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'")\
        .config("spark.cleaner.referenceTracking.cleanCheckpoints", True)
    if gpu and spark23:
        builder.config("spark.jars.packages", maven_gpu_spark23)
    elif spark23:
        builder.config("spark.jars.packages",maven_spark23)
    elif gpu:
        builder.config("spark.jars.packages" ,maven_gpu_spark24)
    else:
        builder.config("spark.jars.packages", maven_spark24)
    return builder.getOrCreate()

I call in the script and start the spark session with the following code: cuda.select_device(1) spark= MF.start(gpu=True)

When I run the htop command, I see that the cpu is at full use (the machine has 120 cpus), the RAM is close to maxing out, but when I run gpustat or nvidia-smi, it seems that the gpu is not being utilized at all. How can I set my model to run on gpu?

BassieWitkin avatar Jul 24 '22 11:07 BassieWitkin

Could you please fill the template for bug reports, provide snippet codes, and any other info that can help us to debug this? (cuda version, driver version, operating system, Spar and Spark NLP versions, etc.)

maziyarpanahi avatar Aug 09 '22 11:08 maziyarpanahi

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Dec 08 '22 00:12 github-actions[bot]