sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

Add configuration of loggers for PySpark

Open j-adamczyk opened this issue 2 years ago • 1 comments

Describe the feature you'd like Currently, everything for PySpark (Processor, training etc.) is logged at INFO level, including basic setup of cluster, for example:

<!-- Site specific YARN configuration properties --> <configuration>     <property>         <name>yarn.resourcemanager.hostname</name>         <value>10.0.215.164</value>         <description>The hostname of the RM.</description>     </property>     <property>         <name>yarn.nodemanager.hostname</name>         <value>algo-1</value>         <description>The hostname of the NM.</description>     </property>     <property>         <name>yarn.nodemanager.webapp.address</name>         <value>algo-1:8042</value>     </property>     <property>         <name>yarn.nodemanager.vmem-pmem-ratio</name>         <value>5</value>         <description>Ratio between virtual memory to physical memory.</description>     </property>     <property>         <name>yarn.resourcemanager.am.max-attempts</name>         <value>1</value>         <description>The maximum number of application attempts.</description>     </property>     <property>         <name>yarn.nodemanager.env-whitelist</name>         <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,YARN_HOME,AWS_CONTAINER_CREDENTIALS_RELATIVE_URI,AWS_REGION</value>         <description>Environment variable whitelist</description>     </property>   <property>    <name>yarn.scheduler.minimum-allocation-mb</name>    <value>1</value>  </property>  <property>    <name>yarn.scheduler.maximum-allocation-mb</name>    <value>32768</value>  </property>  <property>    <name>yarn.scheduler.minimum-allocation-vcores</name>    <value>1</value>  </property>  <property>    <name>yarn.scheduler.maximum-allocation-vcores</name>    <value>8</value>  </property>  <property>    <name>yarn.nodemanager.resource.memory-mb</name>    <value>32768</value>  </property>  <property>    <name>yarn.nodemanager.resource.cpu-vcores</name>    <value>8</value>  </property>
--
 

<br class="Apple-interchange-newline">

This results in a lot of CloudWatch logs, with major downsides:

  • most of the logs are completely useless, printing internal details of infrastructure
  • searching logs for anything actually important is very hard, which is very problematic for training and monitoring mission-critical models
  • it is quite costly for running short jobs, making CloudWatch costs high compared to actual compute costs

Adding options to configure log4j logger, or at least some options to limit this (e.g. minimal logging level), would be to get rid of this. It is also very simple to implement.

How would this feature be used? Please describe. Additional argument(s) passed to e.g. PySparkProcessor.

Describe alternatives you've considered Full customizability is not necessarily required, but setting minimal log level is very important.

j-adamczyk avatar Aug 23 '23 09:08 j-adamczyk

I've had some success passing a configuration similar to this. You might need to adjust it, but after a bit of playing with it, it's configurable. Would be good if this were in the docs or possibly the default.

configuration = [
    {
        "Classification": "spark-log4j",
        "Properties": {
            "log4j.rootCategory": "WARN",
            "log4j.logger.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver": "WARN",
            "log4j.logger.org.sparkproject.jetty": "WARN",
            "log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle": "ERROR",
            "log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper": "INFO",
            "log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter": "INFO",
            "log4j.logger.org.apache.parquet": "ERROR",
            "log4j.logger.parquet": "ERROR",
        },
    },
    {
        "Classification": "hadoop-log4j",
        "Properties": {
            "log4j.rootCategory": "WARN",
            "log4j.threshold": "WARN",
            "hadoop.root.logger": "WARN,console",
            "hadoop.security.logger": "WARN,NullAppender",
            "hdfs.audit.logger": "WARN,NullAppender",
            "namenode.metrics.logger": "WARN,NullAppender",
            "datanode.metrics.logger": "WARN,NullAppender",
            "rm.audit.logger": "WARN,NullAppender",
        },
    },
    {
        "Classification": "hadoop-env",
        "Properties": {},
        "Configurations": [
            {
                "Classification": "export",
                "Properties": {
                    "HADOOP_ROOT_LOGGER": "WARN,console",
                },
                "Configurations": [],
            }
        ],
    },]

jmahlik avatar Aug 23 '23 15:08 jmahlik