ffcv
ffcv copied to clipboard
Error in */ffcv/pipeline/graph.py: "AttributeError: 'str' object has no attribute 'type'"
I am trying to run code from the following repo which uses ffcv: https://github.com/MadryLab/datamodels. I have tried to run the example in this repo both on my university's own slurm cluster and on google colab, and I keep ending up with the same error:
AttributeError: 'str' object has no attribute 'type'
I was able to edit one of the ffcv package source files directly in order to get rid of this error, but then it predictably gave me another error RuntimeError: No HIP GPUs are available Specifically, when I make the edit to the file referenced below (/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py
) by changing line 333 from if next_state.device.type != 'cuda'
to if next_state.device != 'cuda:0'
, I can get it to run on google colab (though I get a different error, RuntimeError: No HIP GPUs are available
on my lab's slurm cluster).
Is anyone else experiencing this error with ffcv when trying to run training code?
Steps to reproduce (the file system shown here is what I used for colab, but you can replace the paths with whatever download directory you use on whatever system you have):
#install dependencies git clone https://github.com/MadryLab/datamodels.git cd datamodels pip install fastargs pip install terminaltables wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2 tar xjf parallel-latest.tar.bz2 cd /content/datamodels/parallel-20240622 ./configure && make make install parallel --version cd /content/datamodels apt update && apt install -y --no-install-recommends libopencv-dev libturbojpeg-dev cp -f /usr/lib/x86_64-linux-gnu/pkgconfig/opencv.pc /usr/lib/x86_64-linux-gnu/pkgconfig/opencv4.pc pip install mosaicml ffcv numba opencv-python import torch pip install cupy-cuda12x from typing import List
#download dataset import torch as ch import torchvision
from ffcv.fields import IntField, RGBImageField
from ffcv.fields.decoders import IntDecoder, SimpleRGBImageDecoder
from ffcv.loader import Loader, OrderOption
from ffcv.pipeline.operation import Operation
from ffcv.transforms import RandomHorizontalFlip, Cutout,
RandomTranslate, Convert, ToDevice, ToTensor, ToTorchImage
from ffcv.transforms.common import Squeeze
from ffcv.writer import DatasetWriter
datasets = {
'train': torchvision.datasets.CIFAR10('/content', train=True, download=True),
'test': torchvision.datasets.CIFAR10('/content', train=False, download=True)
}
for (name, ds) in datasets.items(): writer = DatasetWriter(f'/content/cifar_{name}.beton', { 'image': RGBImageField(), 'label': IntField() }) writer.from_indexed_dataset(ds) bash examples/cifar10/example.sh
I have tried many different conda environments, including with python 3.8 (as the repo suggests), and 3.9, cuda 12.1 and 12.2, and rocm 6 and 5.4. All of them give me one of the two above errors.
Any idea how I can get around this? Full stack trace:
�(0x�(B Parameter �(0x�(B Value �(0x�(B
�(0tqqqqqqqqqqqqqqqqqqqqqqqqqqnqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu�(B
�(0x�(B worker.index �(0x�(B 1 �(0x�(B
�(0x�(B worker.main_import �(0x�(B examples.cifar10.train_cifar �(0x�(B
�(0x�(B worker.logdir �(0x�(B /tmp/10921 �(0x�(B
�(0x�(B worker.do_if_complete �(0x�(B False �(0x�(B
�(0x�(B worker.job_timeout �(0x�(B 99999999 �(0x�(B
�(0x�(B training.lr �(0x�(B 0.5 �(0x�(B
�(0x�(B training.epochs �(0x�(B 24 �(0x�(B
�(0x�(B training.lr_peak_epoch �(0x�(B 5 �(0x�(B
�(0x�(B training.batch_size �(0x�(B 512 �(0x�(B
�(0x�(B training.momentum �(0x�(B 0.9 �(0x�(B
�(0x�(B training.weight_decay �(0x�(B 0.0005 �(0x�(B
�(0x�(B training.label_smoothing �(0x�(B 0.1 �(0x�(B
�(0x�(B training.num_workers �(0x�(B 1 �(0x�(B
�(0x�(B training.lr_tta �(0x�(B True �(0x�(B
�(0x�(B data.train_dataset �(0x�(B /content/cifar_train.beton �(0x�(B
�(0x�(B data.val_dataset �(0x�(B /content/cifar-ffcv/cifar_val.beton �(0x�(B
�(0mqqqqqqqqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�(B
logging in /tmp/10921
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/content/datamodels/datamodels/training/worker.py", line 109, in
Can anyone help me understand why ffcv is throwing this error? Why is there a different semantics to access the device (i.e. why is there no type
property on the systems I'm using as the ffcv library expects? And what is the correct way to handle this?
-Paul