autogluon
autogluon copied to clipboard
Segmentation Fault when Importing AutoGluon
We have noticed that a specific order of import AutoGluon modules will result in seg fault. For example,
import autogluon.text # same applies to autogluon.multimodal
from autogluon.vision import ImagePredictor
and
import autogluon.text. # same applies to autogluon.multimodal
import autogluon.timeseries
will result in
free(): invalid size
Aborted (core dumped)
This is because of some underlying issue with mxnet when it is used along with pytorch. There is not too much AutoGluon team can do about this, but we are actively working toward getting rid of mxnet.
To get around with it, simply reverse the order you import. For example,
from autogluon.vision import ImagePredictor
import autogluon.text
and
import autogluon.timeseries
import autogluon.text
Another option to do import mxnet
before other imports.
Will we just remove the dependency on MXNet in 0.6?
I was able to reproduce this issue before but cannot anymore, with torch==1.12.1
and mxnet==1.9.0
. Could you perhaps check again @yinweisu ?
@sxjscience for timeseries we hope to make it optional and disabled by default.
Created a new env and wasn't able to reproduce it either.
(temp) ubuntu@ip-172-31-11-12:~/yinweisu/autogluon$ pip3 freeze | grep mxnet
mxnet-cu110==1.9.1
(temp) ubuntu@ip-172-31-11-12:~/yinweisu/autogluon$ pip3 freeze | grep torch
pytorch-lightning==1.7.7
pytorch-metric-learning==1.3.2
torch==1.12.1
torchmetrics==0.8.2
torchtext==0.13.1
torchvision==0.13.1
(temp) ubuntu@ip-172-31-11-12:~/yinweisu/autogluon$ python3
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import autogluon.text
>>> from autogluon.vision import ImagePredictor
/home/ubuntu/anaconda3/envs/temp/lib/python3.8/site-packages/gluoncv/__init__.py:40: UserWarning: Both `mxnet==1.9.1` and `torch==1.12.1+cu102` are installed. You might encounter increased GPU memory footprint if both framework are used at the same time.
warnings.warn(f'Both `mxnet=={mx.__version__}` and `torch=={torch.__version__}` are installed. '
>>>
However, according to @gradientsky, there are still reports of such issues. Let's just keep this thread open for anyone encountered this issue as a reference and close it once we completely remove mxnet in 0.7. @canerturkmen
Still reproducible:
download_dir = './ag_petfinder_tutorial'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_kaggle.zip'
from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)
import os
dataset_path = download_dir + '/petfinder_processed'
import pandas as pd
train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/dev.csv', index_col=0)
label = 'AdoptionSpeed'
image_col = 'Images'
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])
def path_expander(path, base_folder):
path_l = path.split(';')
return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])
train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
train_data = train_data.sample(500, random_state=0)
from autogluon.tabular import FeatureMetadata
feature_metadata = FeatureMetadata.from_df(train_data)
feature_metadata = feature_metadata.add_special_types({image_col: ['image_path']})
print(feature_metadata)
from autogluon.tabular.configs.hyperparameter_configs import get_hyperparameter_config
hyperparameters = get_hyperparameter_config('multimodal')
hyperparameters['AG_IMAGE_NN'] = {'model': 'resnet18_v1b'}
hyperparameters['AG_TEXT_NN'] = ['lower_quality_fast_train']
hyperparameters
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label=label).fit(
train_data=train_data,
hyperparameters=hyperparameters,
feature_metadata=feature_metadata,
time_limit=900,
)
leaderboard = predictor.leaderboard(test_data)
Fails on ImagePredictor
start:
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5.10it/s]
Configuration saved in AutogluonModels/ag-20221115_030417/models/TextPredictor/text_nn/hf_text/config.json
tokenizer config file saved in AutogluonModels/ag-20221115_030417/models/TextPredictor/text_nn/hf_text/tokenizer_config.json
Special tokens file saved in AutogluonModels/ag-20221115_030417/models/TextPredictor/text_nn/hf_text/special_tokens_map.json
0.36 = Validation score (accuracy)
446.17s = Training runtime
2.96s = Validation runtime
Fitting model: ImagePredictor ... Training model for up to 437.96s of the 437.96s of remaining time.
free(): invalid size
Aborted (core dumped)
Workaround
import mxnet
before running the script. ^
This should be related to ImagePredictor hasn't switched default backend to MXNet.
Because we have removed vision and text modules in v0.7, this should be less common even if the seg fault still exists.
The only situation where it could happen is by importing timeseries after tabular or multimodal, which is unlikely to be a common user scenario.
Because of this, this issue is only relevant to timeseries since it is the only module using MXNet. Resolving this issue. If this error still occurs in v0.7+ release, please create a new issue indicating the scenario where the error happens.