Polygonization-by-Frame-Field-Learning icon indicating copy to clipboard operation
Polygonization-by-Frame-Field-Learning copied to clipboard

when evaluate the trained mode on inria dataset, process handling

Open XiaoyuSun-hub opened this issue 3 years ago • 4 comments

Hi, I installed the environment in Ubuntu 18.04. I first run the command

python main.py --config configs/config.inria_dataset_osm_aligned.unet_resnet101_pretrained after training finish I run python main.py --config configs/config.inria_dataset_osm_aligned.unet_resnet101_pretrained --mode eval the program will hanging there with the following output: INFO: Loading defaults from configs/config.defaults.inria_dataset_osm_aligned.json INFO: Loading defaults from configs/config.defaults.json INFO: Loading defaults from configs/loss_params.json INFO: Loading defaults from configs/optim_params.json INFO: Loading defaults from configs/polygonize_params.json INFO: Loading defaults from configs/dataset_params.inria_dataset_osm_aligned.json INFO: Loading defaults from configs/eval_params.inria_dataset.json INFO: Loading defaults from configs/eval_params.defaults.json INFO: Loading defaults from configs/backbone_params.unet_resnet101.json GPU 0 -> Using data from /gimastorage/Xiaoyu/data/AerialImageDataset INFO: annotations will be loaded from disk # --- Start evaluating ---# Saving eval outputs to /gimastorage/Xiaoyu/data/AerialImageDataset/eval_runs/inria_dataset_osm_aligned.unet_resnet101_pretrained | 2020-12-05 09:55:09 Loading best val checkpoint: /home/sunx/Polygonization-by-Frame-Field-Learning/frame_field_learning/runs/inria_dataset_osm_aligned.unet_resnet101_pretrained | 2020-12-05 09:55:09/checkpoints/checkpoint.best_val.epoch_000001.tar Eval test: 0%| | 0/34 [00:00<?, ?it/s]Traceback (most recent call last):

It will keep it still, if I stop the process, it gives following errors: Process SpawnProcess-2: Traceback (most recent call last): File "/home/sunx/Polygonization-by-Frame-Field-Learning/main.py", line 387, in Traceback (most recent call last): File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/sunx/Polygonization-by-Frame-Field-Learning/child_processes.py", line 75, in eval_process evaluate(gpu, config, shared_dict, barrier, eval_ds, backbone) File "/home/sunx/Polygonization-by-Frame-Field-Learning/frame_field_learning/evaluate.py", line 62, in evaluate evaluator.evaluate(split_name, eval_ds) File "/home/sunx/Polygonization-by-Frame-Field-Learning/frame_field_learning/evaluator.py", line 85, in evaluate inference.inference_with_patching(self.config, self.model, tile_data) File "/home/sunx/Polygonization-by-Frame-Field-Learning/frame_field_learning/inference.py", line 79, in inference_with_patching assert len(tile_data["image"].shape) == 4 and tile_data["image"].shape[0] == 1,
AssertionError: When using inference with patching, tile_data should have a batch size of 1, with image's shape being (1, C, H, W), not torch.Size([6, 3, 725, 725])

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 26, in _wrap sys.exit(1) SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap util._exit_function() main() File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/util.py", line 334, in _exit_function p.join() File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout)

File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait return self.poll(os.WNOHANG if timeout == 0.0 else 0) File "/home/sunx/Polygonization-by-Frame-Field-Learning/main.py", line 381, in main File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt Traceback (most recent call last): launch_eval(args) File "/home/sunx/Polygonization-by-Frame-Field-Learning/main.py", line 321, in launch_eval File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/sunx/Polygonization-by-Frame-Field-Learning/lydorn_utils/lydorn_utils/async_utils.py", line 8, in async_func_wrapper if not out_queue.empty(): File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/queues.py", line 123, in empty return not self._poll() File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/connection.py", line 424, in _poll r = wait([self], timeout) File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/connection.py", line 924, in wait selector.register(obj, selectors.EVENT_READ) File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/selectors.py", line 352, in register key = super().register(fileobj, events, data) File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/selectors.py", line 244, in register self._fd_to_key[key.fd] = key KeyboardInterrupt torch.multiprocessing.spawn(eval_process, nprocs=args.gpus, args=(config, shared_dict, barrier)) File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 75, in join ready = multiprocessing.connection.wait( File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/connection.py", line 930, in wait ready = selector.select(timeout) File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt Eval test: 0%| | 0/34 [13:02<?, ?it/s]

Process finished with exit code 130

I looked at the code of inference file def inference_with_patching(config, model, tile_data): *assert len(tile_data["image"].shape) == 4 and tile_data["image"].shape[0] == 1, * f"When using inference with patching, tile_data should have a batch size of 1, "
f"with image's shape being (1, C, H, W), not {tile_data['image'].shape}"

Here the code assert needs the data to be a certain size which is different from the patch size.

I run the eval command twice, the output above is the second time, so there is no log about the patching process. the first time, it will first patch the test data.

Other things I do is reduce the data size by changing the code inside the inria_aerial.py

CITY_METADATA_DICT = {

"bellingham": {
    "fold": "test",
    "pixelsize": 0.3,
    "numbers": list([2,3]) ,
    "mean": [0.3766195, 0.391402, 0.32659722],
    "std": [0.18134978, 0.16412577, 0.16369793],
},

"austin": {
    "fold": "train",
    "pixelsize": 0.3,
    "numbers": list(range(1, 2)),
    "mean": [0.39584444, 0.40599795, 0.38298687],
    "std": [0.17341954, 0.16856597, 0.16360443],
}

}

XiaoyuSun-hub avatar Dec 09 '20 21:12 XiaoyuSun-hub

I have met the same question as yours. Have you solved it?

Dingyuan-Chen avatar Feb 26 '22 08:02 Dingyuan-Chen

I also encountered this problem, have you solved it?

Aria918 avatar Mar 18 '22 02:03 Aria918

I was able to overcome this issue by specifying the eval batch size while running the main.py like so:

   python main.py --config config_name --mode eval --eval_batch_size 1

But, I faced another problem immediately after which says:

  RuntimeError: The size of tensor a (1024) must match the size of tensor b (299) at non-singleton dimension 3 in inference.py

I tried to change the patch_size in the config file to 299, but that leads to another error. I would be glad if someone could shed some light on this if they have come across this issue.

Thank you.

kriti115 avatar Apr 22 '22 18:04 kriti115

I found a solution. Set "num_workers": 1 in 'config.defaults.json'.

Shizw695 avatar Nov 02 '23 02:11 Shizw695