autogluon icon indicating copy to clipboard operation
autogluon copied to clipboard

Segmentation Fault when Importing AutoGluon

Open yinweisu opened this issue 2 years ago • 3 comments

We have noticed that a specific order of import AutoGluon modules will result in seg fault. For example,

import autogluon.text  # same applies to autogluon.multimodal
from autogluon.vision import ImagePredictor

and

import autogluon.text. # same applies to autogluon.multimodal
import autogluon.timeseries

will result in

free(): invalid size
Aborted (core dumped)

This is because of some underlying issue with mxnet when it is used along with pytorch. There is not too much AutoGluon team can do about this, but we are actively working toward getting rid of mxnet.

To get around with it, simply reverse the order you import. For example,

from autogluon.vision import ImagePredictor
import autogluon.text

and

import autogluon.timeseries
import autogluon.text

yinweisu avatar Aug 16 '22 18:08 yinweisu

Another option to do import mxnet before other imports.

gradientsky avatar Aug 16 '22 19:08 gradientsky

Will we just remove the dependency on MXNet in 0.6?

sxjscience avatar Aug 23 '22 01:08 sxjscience

I was able to reproduce this issue before but cannot anymore, with torch==1.12.1 and mxnet==1.9.0. Could you perhaps check again @yinweisu ?

@sxjscience for timeseries we hope to make it optional and disabled by default.

canerturkmen avatar Aug 25 '22 10:08 canerturkmen

Created a new env and wasn't able to reproduce it either.

(temp) ubuntu@ip-172-31-11-12:~/yinweisu/autogluon$ pip3 freeze | grep mxnet
mxnet-cu110==1.9.1
(temp) ubuntu@ip-172-31-11-12:~/yinweisu/autogluon$ pip3 freeze | grep torch
pytorch-lightning==1.7.7
pytorch-metric-learning==1.3.2
torch==1.12.1
torchmetrics==0.8.2
torchtext==0.13.1
torchvision==0.13.1
(temp) ubuntu@ip-172-31-11-12:~/yinweisu/autogluon$ python3
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import autogluon.text
>>> from autogluon.vision import ImagePredictor
/home/ubuntu/anaconda3/envs/temp/lib/python3.8/site-packages/gluoncv/__init__.py:40: UserWarning: Both `mxnet==1.9.1` and `torch==1.12.1+cu102` are installed. You might encounter increased GPU memory footprint if both framework are used at the same time.
  warnings.warn(f'Both `mxnet=={mx.__version__}` and `torch=={torch.__version__}` are installed. '
>>> 

However, according to @gradientsky, there are still reports of such issues. Let's just keep this thread open for anyone encountered this issue as a reference and close it once we completely remove mxnet in 0.7. @canerturkmen

yinweisu avatar Nov 08 '22 19:11 yinweisu

Still reproducible:

download_dir = './ag_petfinder_tutorial'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_kaggle.zip'

from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)

import os
dataset_path = download_dir + '/petfinder_processed'

import pandas as pd

train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/dev.csv', index_col=0)

label = 'AdoptionSpeed'
image_col = 'Images'

train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])

def path_expander(path, base_folder):
    path_l = path.split(';')
    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])

train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

train_data = train_data.sample(500, random_state=0)

from autogluon.tabular import FeatureMetadata
feature_metadata = FeatureMetadata.from_df(train_data)
feature_metadata = feature_metadata.add_special_types({image_col: ['image_path']})
print(feature_metadata)


from autogluon.tabular.configs.hyperparameter_configs import get_hyperparameter_config
hyperparameters = get_hyperparameter_config('multimodal')
hyperparameters['AG_IMAGE_NN'] = {'model': 'resnet18_v1b'}
hyperparameters['AG_TEXT_NN'] = ['lower_quality_fast_train']
hyperparameters

from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label=label).fit(
    train_data=train_data,
    hyperparameters=hyperparameters,
    feature_metadata=feature_metadata,
    time_limit=900,
)

leaderboard = predictor.leaderboard(test_data)

Fails on ImagePredictor start:

Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.10it/s]
Configuration saved in AutogluonModels/ag-20221115_030417/models/TextPredictor/text_nn/hf_text/config.json
tokenizer config file saved in AutogluonModels/ag-20221115_030417/models/TextPredictor/text_nn/hf_text/tokenizer_config.json
Special tokens file saved in AutogluonModels/ag-20221115_030417/models/TextPredictor/text_nn/hf_text/special_tokens_map.json
	0.36	 = Validation score   (accuracy)
	446.17s	 = Training   runtime
	2.96s	 = Validation runtime
Fitting model: ImagePredictor ... Training model for up to 437.96s of the 437.96s of remaining time.
free(): invalid size
Aborted (core dumped)

Workaround

import mxnet before running the script. ^

gradientsky avatar Nov 15 '22 03:11 gradientsky

This should be related to ImagePredictor hasn't switched default backend to MXNet.

sxjscience avatar Nov 15 '22 03:11 sxjscience

Because we have removed vision and text modules in v0.7, this should be less common even if the seg fault still exists.

The only situation where it could happen is by importing timeseries after tabular or multimodal, which is unlikely to be a common user scenario.

Because of this, this issue is only relevant to timeseries since it is the only module using MXNet. Resolving this issue. If this error still occurs in v0.7+ release, please create a new issue indicating the scenario where the error happens.

Innixma avatar Feb 03 '23 22:02 Innixma