FedML
FedML copied to clipboard
[data] [feature] provide tools for non-IID data analysis
Thanks for creating the issue. Below are some thoughts on the data analysis. I will continue to update the post if new ideas come up.
- For class non-iid, provide a simple API for the distribution of samples per class and class number counts.
- Attribute distributions. This is critical if different social groups are included in FL. Providing attribute distribution can help to understand the risk of social issues, like fairness, and privacy.
- Client similarity in the feature space. With this tool, we can find distributionally similar clients and find out outliers that may hurt global convergence. 2-dimensional visualization of feature distributions (in a private manner) and a connectivity graph of clients could be nice.
- Gradient/update analysis. Similar reason as the client feature similarity. Gradient analysis was widely used for robust training. For example, an outlier gradient could be a malicious client poisoning the training.
- Except for data non-iid, analysis of the device capability distribution will be helpful for finding the performance bottleneck.