FedML icon indicating copy to clipboard operation
FedML copied to clipboard

[data] [feature] provide tools for non-IID data analysis

Open chaoyanghe opened this issue 3 years ago • 1 comments

chaoyanghe avatar May 02 '22 00:05 chaoyanghe

Thanks for creating the issue. Below are some thoughts on the data analysis. I will continue to update the post if new ideas come up.

  • For class non-iid, provide a simple API for the distribution of samples per class and class number counts.
  • Attribute distributions. This is critical if different social groups are included in FL. Providing attribute distribution can help to understand the risk of social issues, like fairness, and privacy.
  • Client similarity in the feature space. With this tool, we can find distributionally similar clients and find out outliers that may hurt global convergence. 2-dimensional visualization of feature distributions (in a private manner) and a connectivity graph of clients could be nice.
  • Gradient/update analysis. Similar reason as the client feature similarity. Gradient analysis was widely used for robust training. For example, an outlier gradient could be a malicious client poisoning the training.
  • Except for data non-iid, analysis of the device capability distribution will be helpful for finding the performance bottleneck.

jyhong836 avatar May 02 '22 00:05 jyhong836