Some confusing points in profiling

Open PinappleUnderTheSea opened this issue 8 months ago • 0 comments

Hi Vidur Team! I am interested the profiling phase but i found something that confusing me。 In vidur/execution_time_predictor/sklearn_execution_time_predictor.py:494

        if self._replica_config.num_pipeline_stages > 1:
            send_recv_df = self._load_send_recv_df(self._send_recv_input_file)
            send_recv_df = self._get_send_recv_df_with_derived_features(send_recv_df)

            models["send_recv"] = self._train_model(
                model_name="send_recv",
                df=send_recv_df,
                feature_cols=["num_tokens"],
                target_col="time_stats.send_recv.median",
            )

        if self._replica_config.tensor_parallel_size > 1:
            all_reduce_df = self._load_all_reduce_df(self._all_reduce_input_file)
            all_reduce_df = self._get_all_reduce_df_with_derived_features(all_reduce_df)

            models["all_reduce"] = self._train_model(
                model_name="all_reduce",
                df=all_reduce_df,
                feature_cols=["num_tokens"],
                target_col="time_stats.all_reduce.median",
            )

Codes here are training communication time but feature_cols are num_tokens , which does not exist in communication profiling files.

Another question is that in data/profiling/network/a40_pairwise_nvlink/attention.csv. I do not understand why is there an attention data file in network profiling directory.

Thanks!

Apr 23 '25 13:04 PinappleUnderTheSea