pyprophet icon indicating copy to clipboard operation
pyprophet copied to clipboard

Investigate Memory Usage of Scoring

Open singjc opened this issue 6 months ago • 3 comments

Proteomics Dataset

16 runs, ~32K precursors (target + decoy), ~196K transitions, ~1.4M precursor features (peak-groups)

Command
/usr/bin/time pyprophet score --in merged_osw.parquet --level ms1ms2 --classifier SVM --xeval_num_iter 3 --ss_num_iter 3 --threads 3 --profile

Peak RAM usage is ~17.34 GB

1902.14user 1397.48system 23:12.49elapsed 236%CPU (0avgtext+0avgdata 18182704maxresident)k
320392inputs+1639776outputs (285major+10407902minor)pagefaults 0swaps

Note: The total memory allocated reported by memray is virtual memory allocated (i.e. by pandas, numpy, duckdb), not the actual materialized physical memory used.

$ memray stats memray_pyp_score.bin
📏 Total allocations:
	4923580

📦 Total memory allocated:
	13.453GB

📊 Histogram of allocation size:
	min: 1.000B
	----------------------------------------------
	< 7.000B   :   79403 ▇
	< 49.000B  : 2911664 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 345.000B : 1472040 ▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 2.370KB  :  391343 ▇▇▇▇
	< 16.643KB :   55240 ▇
	< 116.825KB:    6330 ▇
	< 820.058KB:    6666 ▇
	< 5.621MB  :     443 ▇
	< 39.460MB :     396 ▇
	<=276.990MB:      55 ▇
	----------------------------------------------
	max: 276.990MB

📂 Allocator type distribution:
	 MALLOC: 4916019
	 MMAP: 6732
	 REALLOC: 702
	 CALLOC: 127

🥇 Top 15 largest allocating locations (by size):
	- <stack trace unavailable> -> 6.132GB     <- This is mostly duckdb
	- __array__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/series.py:1031 -> 1.719GB
	- copy:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/blocks.py:796 -> 1.017GB
	- _fetch_ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:130 -> 978.292MB
	- _take_nd_ndarray:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/array_algos/take.py:157 -> 790.236MB
	- _merge_ms1ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:218 -> 401.331MB
	- _merge_blocks:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/managers.py:2301 -> 331.308MB
	- vstack:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/numpy/_core/shape_base.py:287 -> 331.302MB
	- _stack_arrays:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/managers.py:2252 -> 316.366MB
	- maybe_convert_platform:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:138 -> 222.684MB
	- collect:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/polars/lazyframe/frame.py:2207 -> 135.000MB
	- get_join_indexers_non_unique:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/reshape/merge.py:1795 -> 130.348MB
	- maybe_infer_to_datetimelike:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1189 -> 111.343MB
	- _isna_array:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/dtypes/missing.py:300 -> 107.266MB
	- <listcomp>:/home/singjc/Documents/github/pyprophet/pyprophet/scoring/data_handling.py:239 -> 103.221MB

🥇 Top 15 largest allocating locations (by number of allocations):
	- <stack trace unavailable> -> 3708032
	- _fetch_ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:130 -> 897673
	- __init__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/parquet/core.py:317 -> 97523
	- _merge_ms1ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:218 -> 89539
	- read:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:46 -> 64332
	- _init_duckdb_views:/home/singjc/Documents/github/pyprophet/pyprophet/io/_base.py:982 -> 46096
	- open_binary:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/psutil/_common.py:711 -> 7398
	- __array__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/series.py:1031 -> 5711
	- read_schema:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/parquet/core.py:2348 -> 2208
	- _build_nested_paths:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/parquet/core.py:337 -> 1797
	- _to_pandas_without_object_columns:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/polars/dataframe/frame.py:2483 -> 613
	- table_to_dataframe:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/pandas_compat.py:808 -> 285
	- _subst_vars:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/sysconfig.py:156 -> 180
	- _extend_dict:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/sysconfig.py:168 -> 168
	- _to_pandas_without_object_columns:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/polars/dataframe/frame.py:2484 -> 145

Phosphoproteomics Dataset

20 runs, ~45K precursors (target + decoys), ~5.7M transitions, ~1.8M precursor features (peak-groups)

Command
/usr/bin/time pyprophet score --in merged.oswpq --level ms1ms2 --ss_num_iter 3 --xeval_num_iter 3 --profile

Peak RAM usage is ~9.67 GB

1271.60user 615.97system 18:56.20elapsed 166%CPU (0avgtext+0avgdata 10141896maxresident)k
168inputs+1204336outputs (95major+6227858minor)pagefaults 0swaps

Note: The total memory allocated reported by memray is virtual memory allocated (i.e. by pandas, numpy, duckdb), not the actual materialized physical memory used.

$ memray stats memray_score.bin

📏 Total allocations:
	8573096

📦 Total memory allocated:
	179.028GB

📊 Histogram of allocation size:
	min: 1.000B
	----------------------------------------------
	< 7.000B   :  195538 ▇
	< 60.000B  : 5567520 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 473.000B : 1390693 ▇▇▇▇▇▇▇
	< 3.604KB  :  627854 ▇▇▇
	< 28.088KB :  384064 ▇▇
	< 218.924KB:  303930 ▇▇
	< 1.666MB  :   79467 ▇
	< 12.988MB :   23061 ▇
	< 101.226MB:     882 ▇
	<=788.964MB:      87 ▇
	----------------------------------------------
	max: 788.964MB

📂 Allocator type distribution:
	 MALLOC: 8375834
	 REALLOC: 136973
	 CALLOC: 42899
	 MMAP: 17390

🥇 Top 15 largest allocating locations (by size):
	- _take_nd_ndarray:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/array_algos/take.py:157 -> 88.081GB
	- <stack trace unavailable> -> 23.864GB
	- <listcomp>:/home/singjc/Documents/github/pyprophet/pyprophet/report.py:173 -> 13.667GB
	- copy:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/blocks.py:796 -> 8.521GB
	- __array__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/series.py:1031 -> 4.738GB
	- unique_with_mask:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:438 -> 4.728GB
	- plot_identification_consistency:/home/singjc/Documents/github/pyprophet/pyprophet/report.py:176 -> 4.695GB
	- unique_with_mask:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:440 -> 3.522GB
	- vstack:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/numpy/_core/shape_base.py:287 -> 3.334GB
	- _merge_blocks:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/managers.py:2301 -> 2.884GB
	- _evaluate_standard:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/computation/expressions.py:73 -> 2.780GB
	- _fetch_ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:124 -> 1.554GB
	- _stack_arrays:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/managers.py:2252 -> 1.499GB
	- take:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:1239 -> 1.236GB
	- _getitem_bool_array:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/frame.py:4154 -> 1.234GB

🥇 Top 15 largest allocating locations (by number of allocations):
	- <stack trace unavailable> -> 5635464
	- _fetch_ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:124 -> 948497
	- _init_duckdb_views:/home/singjc/Documents/github/pyprophet/pyprophet/io/_base.py:1246 -> 216931
	- _take_nd_ndarray:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/array_algos/take.py:157 -> 147627
	- __init__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/parquet/core.py:317 -> 119820
	- unique_with_mask:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:440 -> 107224
	- <listcomp>:/home/singjc/Documents/github/pyprophet/pyprophet/report.py:173 -> 105035
	- _any:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/numpy/_core/_methods.py:64 -> 86112
	- unique_with_mask:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:438 -> 84076
	- plot_identification_consistency:/home/singjc/Documents/github/pyprophet/pyprophet/report.py:176 -> 79352
	- read:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:49 -> 64332
	- _write_parquet_with_scores:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:345 -> 64260
	- maybe_convert_indices:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/indexers/utils.py:280 -> 63383
	- transform_affine:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/matplotlib/transforms.py:1865 -> 55828
	- _write_parquet_with_scores:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:351 -> 51864

singjc avatar Jun 04 '25 16:06 singjc

Which branch is this on? It does not seem that I am getting a --profile option

jcharkow avatar Jun 04 '25 19:06 jcharkow

Which branch is this on? It does not seem that I am getting a --profile option

Should be available in the master branch: https://github.com/PyProphet/pyprophet/blob/master/pyprophet%2Fcli%2Fscore.py#L241

singjc avatar Jun 04 '25 19:06 singjc

Thanks! I figured it out, I was on the wrong branch.

jcharkow avatar Jun 04 '25 19:06 jcharkow