ipex-llm
ipex-llm copied to clipboard
[Discussion] Operations needed to be supported in shards
To support better user experience to use orca shards, created this issue to discuss which operations are needed to support in orca shards.
-
[ ] Scaler
- [x] minmaxscaler
- [x] standardscaler https://github.com/intel-analytics/BigDL/pull/5716
-
[ ] Encode categorical variables
- [x] label encoder
- [ ] onehot encoding (get_dummies in pandas)
-
[ ] Merge (join) Has a task https://github.com/orgs/analytics-zoo/projects/14/views/4
-
[ ] Not (
~
operation in pandas) -
[ ] statisticas
- [ ] missing values - [ ] count missing values for each column - [ ] delete null values - [ ] fill in null values (maybe various imputations)
- [ ] groupby
- [ ] agg
- [ ] mean
- [ ] max
- [ ] sum
- [ ] sort_values (nice to have)
Above operations are motivated from below links: https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python operations used: 1. isnull, sum, sort_values, standard_scaler, get_dummies 2. nice to have: describe(summary of dataframe), correlation, arg_sort
https://www.kaggle.com/code/isaienkov/riiid-answer-correctness-prediction-eda-modeling operations used: 1. isnull, sum, groupby, agg, merge, fillna, not 2. nice to have: sklearn.feature_selection.rfe
https://www.kaggle.com/code/ammar111/youtube-trending-videos-analysis operations used: 1. fillna, isna, value_counts, count, filter, groupby, 2. nice to have: describe, most_common, corr, sort_values
https://www.kaggle.com/code/jiashenliu/introduction-to-financial-concepts-and-data operations used: 1. filter, get pd series to np array and using numpy operation to process to create a new column
Please summarize for each example, what additional operations are needed