dbn-based-nids
dbn-based-nids copied to clipboard
how to make the balanced dataset?
First of all, thank you so much for sharing your work, it has been very helpful. But I still have a small problem, I hope to get your help. I find that the balanced data file was not generated after running the code (cicids2017.py) . How can i get the balanced data? Looking forward to your reply, thank you again!
Thank you for the author's work. I am reproducing this paper,I also encountered the same problem. May I ask how to solve it? Thank you!
well,I hace encounter the same problem. But I think it is easy to solve. just need to resample. Please check out what i do, I just rewrite ./preprocessing/cicids2017.py: def scale() function:
def scale(self, training_set, validation_set, testing_set):
""""""
(X_train, y_train), (X_val, y_val), (X_test, y_test) = training_set, validation_set, testing_set
categorical_features = self.features.select_dtypes(exclude=["number"]).columns
numeric_features = self.features.select_dtypes(exclude=[object]).columns
preprocessor = ColumnTransformer(transformers=[
('categoricals', OneHotEncoder(drop='first', sparse=False, handle_unknown='error'), categorical_features),
('numericals', QuantileTransformer(), numeric_features)
])
# Preprocess the features
columns = numeric_features.tolist()
X_train = pd.DataFrame(preprocessor.fit_transform(X_train), columns=columns)
X_val = pd.DataFrame(preprocessor.transform(X_val), columns=columns)
X_test = pd.DataFrame(preprocessor.transform(X_test), columns=columns)
# Preprocess the labels
le = LabelEncoder()
y_train = pd.DataFrame(le.fit_transform(y_train), columns=["label"])
y_val = pd.DataFrame(le.transform(y_val), columns=["label"])
y_test = pd.DataFrame(le.transform(y_test), columns=["label"])
# Resample the training data to address class imbalance
train_data = pd.concat([X_train, y_train], axis=1) # Combine features and labels
resampled_data = [] # List to store resampled data
min_samples = 20000
# Iterate over each class label
for label_value in y_train["label"].unique():
# Resample data for the current class
class_data = train_data[train_data["label"] == label_value]
# resampled_class_data = resample(class_data, n_samples=20000, random_state=123, replace=True)
if len(class_data) < min_samples:
# If the number of samples is less than the required minimum, perform resampling with replacement
resampled_class_data = resample(class_data, n_samples=min_samples, random_state=123, replace=True)
else:
# Otherwise, perform resampling without replacement
resampled_class_data = resample(class_data, n_samples=min_samples, random_state=123, replace=False)
resampled_data.append(resampled_class_data)
# Combine the resampled data for all classes
resampled_data_cat = pd.concat(resampled_data)
X_train_resampled = resampled_data_cat.drop("label", axis=1)
y_train_resampled = resampled_data_cat["label"]
return (X_train_resampled, y_train_resampled), (X_val, y_val), (X_test, y_test)