Avazu_x4 weirdly requires an extremely large amount of video memory.
I directly downloaded the preprocessed Avazu_x4 dataset, then discarded the id feature, and processed all other features as string categorical features. However, it is very strange that it needs a huge amount of GPU memory to be loaded (about 31G !!!!), and I can’t even use this dataset on Nvidia V100 32G because of OOM. Is this normal? Is there any way to fix this?
We run the benchmark experiment on Avazu_x4 on 16G GPU. Can you provide more details? Which experiment steps do you follow?
I use DeepFM implemented by FuxiCTR(2.0+), dataset is downloaded from link in this section and don't do any preprocess(nor did the preprocessing of x4_001 or x4_002 below).
my setting: dataset_config.yaml
Avazu:
data_format: csv
data_root: ../../../data/
feature_cols:
[
{active: False, dtype: str, name: ["id"], type: categorical},
{
active: True,
dtype: str,
name:
[
"C1",
"hour",
"banner_pos",
"site_id",
"site_domain",
"site_category",
"app_id",
"app_domain",
"app_category",
"device_id",
"device_ip",
"device_model",
"device_type",
"device_conn_type",
"C14",
"C15",
"C16",
"C17",
"C18",
"C19",
"C20",
"C21",
],
type: categorical,
},
]
label_col: { dtype: int, name: "click" }
min_categr_count: 1
test_data: ../../../data/Avazu/test.csv
train_data: ../../../data/Avazu/train.csv
valid_data: ../../../data/Avazu/valid.csv
model_config.yaml
DeepFM_Avazu:
batch_norm: True
batch_size: 4096
dataset_id: Avazu
early_stop_patience: 2
embedding_dim: 32
embedding_regularizer: 0.01
epochs: 100
hidden_activations: relu
hidden_units: [400, 400, 400]
learning_rate: 1.e-3
loss: "binary_crossentropy"
metrics: ["logloss", "AUC"]
model: DeepFM
monitor: AUC
monitor_mode: max
net_dropout: 0.3
net_regularizer: 0
optimizer: adam
seed: 2023
shuffle: True
task: binary_classification
verbose: 1
Another information that may be useful is that I found that the GPU memory usage are not significantly influenced by batch_size . I have tried training with batch_size = 2, but OOM errors still occur.
I figured out a way to avoid this problem. After trying I found out that this OOM is caused by setting the dtype to str, so I changed dataset_config.yaml:
Avazu:
data_format: csv
data_root: ../../data/
feature_cols:
[
{
active: False,
dtype: str,
name: ["id"],
type: categorical,
},
{
active: True,
dtype: str,
name:
[
"site_id",
"site_domain",
"site_category",
"app_id",
"app_domain",
"app_category",
"device_id",
"device_ip",
"device_model",
],
type: categorical,
},
{
active: True,
dtype: int,
name: [
"hour",
"C1",
"banner_pos",
"device_type",
"device_conn_type",
"C14",
"C15",
"C16",
"C17",
"C18",
"C19",
"C20",
"C21",
],
type: categorical,
},
]
label_col: { dtype: int, name: "click" }
min_categr_count: 1
test_data: ../../data/Avazu/test.csv
train_data: ../../data/Avazu/train.csv
valid_data: ../../data/Avazu/valid.csv
But under this setting, fuxi_ctr seems to have a bug, it will report an error:
Traceback (most recent call last):
File "xxx/FuxiCTR/model_zoo/MY_MODEL/run_expid.py", line 65, in <module>
params["train_data"], params["valid_data"], params["test_data"] = build_dataset(
File "xxx/FuxiCTR/fuxictr/preprocess/build_dataset.py", line 104, in build_dataset
feature_encoder.fit(train_ddf, **kwargs)
File "xxx/FuxiCTR/fuxictr/preprocess/feature_processor.py", line 139, in fit
self.save_vocab(self.vocab_file)
File "xxx/FuxiCTR/fuxictr/preprocess/feature_processor.py", line 334, in save_vocab
fd.write(json.dumps(vocab, indent=4))
File "xxx/anaconda3/envs/py39/lib/python3.9/json/__init__.py", line 234, in dumps
return cls(
File "xxx/anaconda3/envs/py39/lib/python3.9/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "xxx/anaconda3/envs/py39/lib/python3.9/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "xxx/anaconda3/envs/py39/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "xxx/anaconda3/envs/py39/lib/python3.9/json/encoder.py", line 376, in _iterencode_dict
raise TypeError(f'keys must be str, int, float, bool or None, '
TypeError: keys must be str, int, float, bool or None, not int64
It looks like it doesn't convert numpy's integer type to python's built-in integer type correctly, I fixed this problem by modifying the source code:
# in file FuxiCTR/fuxictr/preprocess/feature_processor.py
def save_vocab(self, vocab_file):
logging.info("Save feature_vocab to json: " + vocab_file)
vocab = dict()
for feature, spec in self.feature_map.features.items():
if spec["type"] in ["categorical", "sequence"]:
vocab[feature] = OrderedDict(
sorted(self.processor_dict[feature + "::tokenizer"].vocab.items(), key=lambda x:x[1]))
print("before:",[str(k)+": "+str(set(str(type(kk)) for kk in vocab[k])) for k in vocab])
for sub_dict in vocab.values():
for k in list(sub_dict.keys()):
if isinstance(k, (np.int8, np.int16, np.int32, np.int64)):
sub_dict[int(k)] = sub_dict[k]
del sub_dict[k]
print("after:",[str(k)+": "+str(set(str(type(kk)) for kk in vocab[k])) for k in vocab])
with open(vocab_file, "w") as fd:
fd.write(json.dumps(vocab, indent=4))
Please consider to fix it officially in the next update.
Now, GPU memory usage is satisfactory:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:32:00.0 Off | 0 |
| N/A 46C P0 68W / 300W | 5188MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 55632 C python 5159MiB |
+-----------------------------------------------------------------------------+
Close after fixed.