vega icon indicating copy to clipboard operation
vega copied to clipboard

GPU配置

Open wangfeiyu-zerobug opened this issue 1 year ago • 14 comments

一机多卡怎么配置device啊 好像没看到相关介绍

wangfeiyu-zerobug avatar Sep 23 '22 07:09 wangfeiyu-zerobug

@wangfeiyu-zerobug

参考如下配置: https://github.com/huawei-noah/vega/blob/master/docs/cn/user/config_reference.md#2-%E5%85%AC%E5%85%B1%E9%85%8D%E7%BD%AE%E9%A1%B9

将如下配置项配置为true后,使用本机所有GPU搜索模型:

parallel_search: True

若不希望使用所有的GPU,设置parallel_search: True后,还需要设置环境变量CUDA_VISIBLE_DEVICES,如:export CUDA_VISIBLE_DEVICES=0,1,2,使用三块GPU。

zhangjiajin avatar Sep 26 '22 07:09 zhangjiajin

似乎是并行计算库不能启动? INFO:root:------------------------------------------------ INFO:root: Step: serial INFO:root:------------------------------------------------ INFO:root:master ip and port: 127.0.0.1:28703 INFO:root:Initializing cluster. Please wait. INFO:root:Dask-scheduler not start. Start dask-scheduler in master 127.0.0.1 ERROR:vega.core.pipeline.pipeline:Failed to run pipeline, message: [Errno 2] No such file or directory: 'dask-scheduler': 'dask-scheduler' ERROR:vega.core.pipeline.pipeline:Traceback (most recent call last): File "/root/.local/lib/python3.7/site-packages/vega/core/pipeline/pipeline.py", line 84, in run pipestep = PipeStep(name=step_name) File "/root/.local/lib/python3.7/site-packages/vega/core/pipeline/search_pipe_step.py", line 45, in init self.master = create_master(update_func=self.generator.update) File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/master_ops.py", line 44, in create_master master_instance = Master(**kwargs) File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 65, in init status = self.dask_env.start() File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/dask_env.py", line 119, in start self._start_dask() File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/dask_env.py", line 155, in _start_dask scheduler_p = run_scheduler(ip=master_ip, port=master_port, tmp_file=scheduler_file) File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/run_dask.py", line 56, in run_scheduler env=os.environ File "/opt/conda/lib/python3.7/subprocess.py", line 800, in init restore_signals, start_new_session) File "/opt/conda/lib/python3.7/subprocess.py", line 1551, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'dask-scheduler': 'dask-scheduler'

wangfeiyu-zerobug avatar Sep 26 '22 10:09 wangfeiyu-zerobug

pip install dask Requirement already satisfied: dask in /root/.local/lib/python3.7/site-packages (2022.2.0) Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.7/site-packages (from dask) (21.3) Requirement already satisfied: partd>=0.3.10 in /root/.local/lib/python3.7/site-packages (from dask) (1.3.0) Requirement already satisfied: toolz>=0.8.2 in /root/.local/lib/python3.7/site-packages (from dask) (0.12.0) Requirement already satisfied: fsspec>=0.6.0 in /root/.local/lib/python3.7/site-packages (from dask) (2022.8.2) Requirement already satisfied: cloudpickle>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from dask) (2.1.0) Requirement already satisfied: pyyaml>=5.3.1 in /opt/conda/lib/python3.7/site-packages (from dask) (5.4.1) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=20.0->dask) (2.4.7) Requirement already satisfied: locket in /root/.local/lib/python3.7/site-packages (from partd>=0.3.10->dask) (1.0.0)

wangfeiyu-zerobug avatar Sep 26 '22 10:09 wangfeiyu-zerobug

已解决! 提个建议 :这个路径采用os.environ这个接口导入,我是在docker起的容器中运行的,所以这块解决方法可能不太一样,这块是不是可以做一些补充 另外 https://github.com/huawei-noah/vega/blob/master/docs/cn/user/config_reference.md#2-%E5%85%AC%E5%85%B1%E9%85%8D%E7%BD%AE%E9%A1%B9 -2-2.1中general的pytorch打错了

wangfeiyu-zerobug avatar Sep 27 '22 01:09 wangfeiyu-zerobug

@wangfeiyu-zerobug

感谢你的建议。 请问你在容器中运行,解决方法是怎样的?

zhangjiajin avatar Sep 27 '22 01:09 zhangjiajin

jupyter命令行:%env PATH=/root/.local/bin:

wangfeiyu-zerobug avatar Sep 27 '22 01:09 wangfeiyu-zerobug

@wangfeiyu-zerobug

谢谢,我们及时刷新。

zhangjiajin avatar Sep 27 '22 02:09 zhangjiajin

如果想利用已经搜索出的网络测试另一批数据,该怎么做呢?还需要通过pipline进行fulltrain嘛

wangfeiyu-zerobug avatar Oct 13 '22 08:10 wangfeiyu-zerobug

是的,需要fullytrain,看下精度。

zhangjiajin avatar Oct 14 '22 02:10 zhangjiajin

那是需要重新训练整个搜索好的网络? 目前没有提供yaml配置选项去调用模型参数只进行测试数据嘛

wangfeiyu-zerobug avatar Oct 14 '22 03:10 wangfeiyu-zerobug

若搜索时的pipeline中包含了fullytrain,就不需要重新训练。 测试的代码可参考https://github.com/huawei-noah/vega/blob/39741b5ddd9623f0984599d7f52ea38ef6f253c1/vega/tools/inference.py

zhangjiajin avatar Oct 14 '22 03:10 zhangjiajin

File "testcode.py", line 147, in main() File "testcode.py", line 141, in main result = _infer(args, loader, model) File "testcode.py", line 50, in _infer return _infer_pytorch(args, model, loader) File "testcode.py", line 70, in _infer_pytorch infer_result = model(**batch) TypeError: FasterRCNN object argument after ** must be a mapping, not list 我之前采用SP-NAS搜索的网络 !python testcode.py -c '/workspace/wfyexp/vega/vega-master/examples/nas/sp_nas/tasks/0928.183347.496/output/fullytrain/desc_4.json' -m '/workspace/wfyexp/vega/vega-master/examples/nas/sp_nas/tasks/0928.183347.496/output/fullytrain/model_4.pth' -df "COCO" -dp '/workspace/data/upper/added_dataset_COCO_format' 数据集就是更换了test部分的数据,然后这里显示dataload的输出结果放入mode时出错 打印batch结果: [[tensor([[[0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216], [0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216], [0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216], ..., [0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824], [0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824], [0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824]],

    [[0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     [0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     [0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     ...,
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902],
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902],
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902]],

    [[0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     [0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     [0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     ...,
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784],
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784],
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784]]])], [{'boxes': tensor([[190.9997, 410.9996, 250.9997, 470.9997],
    [412.9995, 312.0001, 477.9995, 365.0001]]), 'labels': tensor([1, 1]), 'masks': tensor([[[0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     ...,
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0]],

    [[0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     ...,
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0]]], dtype=torch.uint8), 'image_id': tensor([1]), 'area': tensor([3600.0029, 3445.0051]), 'iscrowd': tensor([0, 0])}]]

为什么dataload出来会是map呢?

wangfeiyu-zerobug avatar Oct 14 '22 11:10 wangfeiyu-zerobug

数据格式是否是COCO格式的?

另外对于检测,需要参考这个代码:https://github.com/huawei-noah/vega/blob/39741b5ddd9623f0984599d7f52ea38ef6f253c1/vega/tools/detection_inference.py

zhangjiajin avatar Oct 17 '22 10:10 zhangjiajin