vidur
vidur copied to clipboard
Some confusing points in profiling
Hi Vidur Team!
I am interested the profiling phase but i found something that confusing me。
In vidur/execution_time_predictor/sklearn_execution_time_predictor.py:494
if self._replica_config.num_pipeline_stages > 1:
send_recv_df = self._load_send_recv_df(self._send_recv_input_file)
send_recv_df = self._get_send_recv_df_with_derived_features(send_recv_df)
models["send_recv"] = self._train_model(
model_name="send_recv",
df=send_recv_df,
feature_cols=["num_tokens"],
target_col="time_stats.send_recv.median",
)
if self._replica_config.tensor_parallel_size > 1:
all_reduce_df = self._load_all_reduce_df(self._all_reduce_input_file)
all_reduce_df = self._get_all_reduce_df_with_derived_features(all_reduce_df)
models["all_reduce"] = self._train_model(
model_name="all_reduce",
df=all_reduce_df,
feature_cols=["num_tokens"],
target_col="time_stats.all_reduce.median",
)
Codes here are training communication time but feature_cols are num_tokens , which does not exist in communication profiling files.
Another question is that in data/profiling/network/a40_pairwise_nvlink/attention.csv. I do not understand why is there an attention data file in network profiling directory.
Thanks!