[data] [feature] provide tools for non-IID data analysis

Open chaoyanghe opened this issue 3 years ago • 1 comments

May 02 '22 00:05 chaoyanghe

Thanks for creating the issue. Below are some thoughts on the data analysis. I will continue to update the post if new ideas come up.

For class non-iid, provide a simple API for the distribution of samples per class and class number counts.
Attribute distributions. This is critical if different social groups are included in FL. Providing attribute distribution can help to understand the risk of social issues, like fairness, and privacy.
Client similarity in the feature space. With this tool, we can find distributionally similar clients and find out outliers that may hurt global convergence. 2-dimensional visualization of feature distributions (in a private manner) and a connectivity graph of clients could be nice.
Gradient/update analysis. Similar reason as the client feature similarity. Gradient analysis was widely used for robust training. For example, an outlier gradient could be a malicious client poisoning the training.
Except for data non-iid, analysis of the device capability distribution will be helpful for finding the performance bottleneck.

May 02 '22 00:05 jyhong836