scalene icon indicating copy to clipboard operation
scalene copied to clipboard

Profilling Metaflow using Scalene

Open yudhiesh opened this issue 1 year ago • 2 comments

I use Metaflow for Machine Learning pipelines at my organisation and would like to integrate Scalene into it for profiling CPU, MEMORY and GPU usage, but I can't see how it would work currently for jobs that are run on AWS Batch.

I did manage to get it to work on local compute, here is the generated HTML file: Screenshot 2022-10-06 at 10 49 36 PM

Metaflow makes it easy to scale up Python code defined in a DAG using resources such as AWS Batch and Kubernetes, an example can be found here. The DAG is containerised for you, and is then run with some Metaflow-specific commands to run the job, here is an example of the command:

"command": [
    "bash",
    "-c",
    "true && mkdir -p $PWD/.logs && export PYTHONUNBUFFERED=x MF_PATHSPEC=TestFlow/sfn-$METAFLOW_RUN_ID/start/$AWS_BATCH_JOB_ID MF_DATASTORE=s3 MF_ATTEMPT=$((AWS_BATCH_JOB_ATTEMPT-1)) MFLOG_STDOUT=$PWD/.logs/mflog_stdout MFLOG_STDERR=$PWD/.logs/mflog_stderr && mflog(){ T=$(date -u -Ins|tr , .); echo \"[MFLOG|0|${T:0:26}Z|task|$T]$1\" >> $MFLOG_STDOUT; echo $1;  } && mflog 'Setting up task environment.' && python -m pip install requests -qqq && python -m pip install awscli boto3 -qqq && mkdir metaflow && cd metaflow && mkdir .metaflow && i=0; while [ $i -le 5 ]; do mflog 'Downloading code package...'; python -m awscli ${METAFLOW_S3_ENDPOINT_URL:+--endpoint-url=\"${METAFLOW_S3_ENDPOINT_URL}\"} s3 cp s3://metaflow-alerting-test-metaflows3bucket-175pkvpc4sejr/metaflow/TestFlow/data/1b/1b571429149ef342ba3410df8c942f907ff52336 job.tar >/dev/null && mflog 'Code package downloaded.' && break; sleep 10; i=$((i+1)); done && if [ $i -gt 5 ]; then mflog 'Failed to download code package from s3://metaflow-alerting-test-metaflows3bucket-175pkvpc4sejr/metaflow/TestFlow/data/1b/1b571429149ef342ba3410df8c942f907ff52336 after 6 tries. Exiting...' && exit 1; fi && TAR_OPTIONS='--warning=no-timestamp' tar xf job.tar && mflog 'Task is starting.' && (if ! python sample_flow.py dump --max-value-size=0 sfn-${METAFLOW_RUN_ID}/_parameters/${AWS_BATCH_JOB_ID}-params >/dev/null 2>/dev/null; then python -m metaflow.plugins.aws.step_functions.set_batch_environment parameters bibigynuqf && . `pwd`/bibigynuqf && python sample_flow.py --with batch:cpu=1,gpu=0,memory=4096,image=python:3.8,queue=arn:aws:batch:us-east-1:999999999999:job-queue/job-queue-metaflow-alerting-test,iam_role=arn:aws:iam::999999999999:role/metaflow-alerting-test-BatchS3TaskRole-1UZBOYBHGVOR8 --quiet --metadata=service --environment=local --datastore=s3 --datastore-root=s3://metaflow-alerting-test-metaflows3bucket-175pkvpc4sejr/metaflow --event-logger=nullSidecarLogger --monitor=nullSidecarMonitor --no-pylint --with=step_functions_internal init --run-id sfn-$METAFLOW_RUN_ID --task-id ${AWS_BATCH_JOB_ID}-params; fi && python sample_flow.py --with batch:cpu=1,gpu=0,memory=4096,image=python:3.8,queue=arn:aws:batch:us-east-1:999999999999:job-queue/job-queue-metaflow-alerting-test,iam_role=arn:aws:iam::role/metaflow-alerting-test-BatchS3TaskRole-1UZBOYBHGVOR8 --quiet --metadata=service --environment=local --datastore=s3 --datastore-root=s3://metaflow-alerting-test-metaflows3bucket-175pkvpc4sejr/metaflow --event-logger=nullSidecarLogger --monitor=nullSidecarMonitor --no-pylint --with=step_functions_internal step start --run-id sfn-$METAFLOW_RUN_ID --task-id ${AWS_BATCH_JOB_ID} --retry-count $((AWS_BATCH_JOB_ATTEMPT-1)) --max-user-code-retries 0 --input-paths sfn-${METAFLOW_RUN_ID}/_parameters/${AWS_BATCH_JOB_ID}-params) 1>> >(python -m metaflow.mflog.tee task $MFLOG_STDOUT) 2>> >(python -m metaflow.mflog.tee task $MFLOG_STDERR >&2); c=$?; python -m metaflow.mflog.save_logs; exit $c"
]

Currently, with the way Scalene works, you can only run the profiler using the Scalene CLI like so python3 -m scalene sample.py --cli --json --outfile sample.json.

Meaning in order to profile the AWS Batch Job there need to be changes made to the AWS Batch Job Definition, i.e., changing python sample_flow.py --with batch:cpu=1,gpu=0,memory=4096 to python3 -m scalene sample_flow.py --with batch:cpu=1,gpu=0,memory=4096 within the long bash command. Is there any way to avoid the use of making changes to the way the container is run with Scalene? If it can't then I can probably see if I can make changes to inject scalene and its arguments into the command if a run is being profiled but it's a big IF.

Is it possible to run scalene as a subprocess to profile the Process ID in a Metaflow Job run on AWS Batch? Metaflow does log the Process ID and it seems that Scalene outputs the Process ID that it is currently being run to stdout:

Metaflow 2.7.11 executing TestFlow for user:yravindranath
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
2022-10-06 22:43:02.151 Workflow starting (run-id 98):
2022-10-06 22:43:08.532 [98/start/219 (pid 1820)] Task is starting.
2022-10-06 22:43:08.534 [98/start/219 (pid 1820)] 1820 <- HERE
2022-10-06 22:43:18.559 [98/start/219 (pid 1820)] Start Hello World
2022-10-06 22:43:28.250 [98/start/219 (pid 1820)] Task finished successfully.
2022-10-06 22:43:32.495 [98/end/220 (pid 1843)] Task is starting.
2022-10-06 22:43:32.496 [98/end/220 (pid 1843)] 1843 <- HERE
2022-10-06 22:43:41.712 [98/end/220 (pid 1843)] End Bye World
2022-10-06 22:43:50.818 [98/end/220 (pid 1843)] Task finished successfully.
2022-10-06 22:43:52.357 Done!

Based on the output from scalene --help it does seem to be somewhat possible although its for suspending and resuming:

When running Scalene in the background, you can suspend/resume profiling
for the process ID that Scalene reports. For example:

   % python3 -m scalene  yourprogram.py &
 Scalene now profiling process 12345
   to suspend profiling: python3 -m scalene.profile --off --pid 12345
   to resume profiling:  python3 -m scalene.profile --on  --pid 12345

My end goal here would be to run profiles of each step in the DAG in Metaflow and aggregate the final HTML to be included in the Metaflow UI as a Metaflow Card, this way runs can be easily profiled and their reports can be versioned and tracked together with the run. I would need to save the produced HTML document within the container, within each step and load them like in this basic example.

yudhiesh avatar Oct 06 '22 15:10 yudhiesh

I came here looking for a way to profile memory usage in my Metaflow pipeline. Since this approach didn't work for me, I adopted another way to profile memory for my pipeline using a Gist posted by one of the Metaflow engineers. For those intererested, here is the link: profileflow.py

pai-sameen avatar Nov 10 '22 11:11 pai-sameen

I came here looking for a way to profile memory usage in my Metaflow pipeline. Since this approach didn't work for me, I adopted another way to profile memory for my pipeline using a Gist posted by one of the Metaflow engineers. For those intererested, here is the link: profileflow.py

I've used memory_profiler before and its painfully slow for the pipeline I'm profiling so it's not a viable option. Scalene has a much lower overhead compared to it so preferably I would use it over memory_profiler. I'll spend some time working on this soon.

yudhiesh avatar Nov 10 '22 11:11 yudhiesh