scalene Unexpected high overhead in threading.run when profiling with Scalene (even for simple scripts)

While analyzing a package I wrote myself using --profile-all, I noticed that the threading module had significant amount of overhead. (I’m still not very familiar with threads, so please forgive me if my question is naive.)

I wrote a simple loop script and found that as along as I enabled --profile-all, it showed heavy usage under threading.run. That portion of overhead are likely attributed to Python part without using --profile-all.

for i in range(100000000):
    pass

PYTHONPATH=. scalene --cli --profile-all --reduced-profile --outfile "$REPORT_PATH" "$SCRIPT_PATH"

                                                     /home/bigorange/miniconda3/envs/huggingface/lib/python3.11/threading.py: % of time =  66.67% (3.023s) out of 4.534s.                                                      
       ╷       ╷       ╷      ╷       ╷       ╷       ╷               ╷       ╷                                                                                                                                                
       │Time   │–––––– │––––… │–––––– │Memory │–––––– │–––––––––––    │Copy   │                                                                                                                                                
  Line │Python │native │syst… │GPU    │Python │peak   │timeline/%     │(MB/s) │/home/bigorange/miniconda3/envs/huggingface/lib/python3.11/threading.py                                                                         
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
   ... │       │       │      │       │       │       │               │       │                                                                                                                                                
   982 │       │   64% │   2% │       │       │       │               │       │                self._target(*self._args, **self._kwargs)                                                                                       
   ... │       │       │      │       │       │       │               │       │                                                                                                                                                
       │       │       │      │       │       │       │               │       │                                                                                                                                                
╶──────┼───────┼───────┼──────┼───────┼───────┼───────┼───────────────┼───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╴
       │       │       │      │       │       │       │               │       │function summary for /home/bigorange/miniconda3/envs/huggingface/lib/python3.11/threading.py                                                    
   971 │       │   64% │   2% │       │       │       │               │       │Thread.run                                                                                                                                      
       ╵       ╵       ╵      ╵       ╵       ╵       ╵               ╵       ╵                                                                                                                                                
                                                          /home/bigorange/projects/CS336/experiments/scripts/scalene1.py: % of time =  33.33% (1.511s) out of 4.534s.                                                          
       ╷       ╷       ╷      ╷       ╷       ╷       ╷               ╷       ╷                                                                                                                                                
       │Time   │–––––– │––––… │–––––– │Memory │–––––– │–––––––––––    │Copy   │                                                                                                                                                
  Line │Python │native │syst… │GPU    │Python │peak   │timeline/%     │(MB/s) │/home/bigorange/projects/CS336/experiments/scripts/scalene1.py                                                                                  
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
   ... │       │       │      │       │       │       │               │       │                                                                                                                                                
     3 │   31% │    1% │   1% │       │       │       │               │       │    pass                                                                                                                                        
   ... │       │       │      │       │       │       │               │       │                                                                                                                                                
       ╵       ╵       ╵      ╵       ╵       ╵       ╵               ╵       ╵

I'm really confused—does this overhead come from Scalene's profile itself, or is it from my own program (since I use multiprocessing in my package)?

The total runtime is about 4.699 seconds using --profile-all. According to the report, the Python part takes up about 30% of the time (around 1.55 seconds). However, when I run the script by itself (without profiling), it takes about 2 seconds to complete. I am confused about this difference in execution time.

OS: Ubuntu 22.04.5 LTS Python: Python 3.11.11 Scalene: Scalene version 1.5.51

Jun 02 '25 07:06 MotingLiuu

If you use --profile-all, it will profile everything, including Python libraries. To avoid profiling those, you can add --profile-exclude lib/python3 to your command line.
- This leads me to wonder if it would be better if Scalene's --profile-all ignored libraries; we could add a separate command --profile-libs for those. I do wonder what your goal is in invoking profile-all here.
There is overhead associated with memory profiling (something we plan to reduce in the future, but only for Python 3.12+). If you only care about CPU time, add --cpu to your command line (you can also add --gpu).

Jun 11 '25 00:06 emeryberger

Thanks for your response.

I'm using --profile-all because my program integrates and calls multiple packages that I've developed. My objective is to get a unified profile that shows the resource consumption across all of these packages and their internal components. This allows me to see the "big picture" and identify the single slowest part of my program. Without profiling everything together, it would be difficult to get a comparison of resource usage across different parts of my complete program.

Jun 27 '25 09:06 MotingLiuu

Did you try using --profile-exclude as suggested above?

Jul 12 '25 18:07 emeryberger

I gave it a try, and it worked out. Thanks a lot.

Jul 21 '25 09:07 MotingLiuu