ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

native launch issue

Open Shikamaru5 opened this issue 1 year ago • 5 comments

So in the documentation it specifies that colossalai.launch or launch_from_torch should say:

     colossalai.launch(config=./config,
                    rank=args.rank,
                    world_size=args.world_size,
                    host=args.host,
                    port=args.port,
                    backend=args.backend
  )

However, when I try and run that it tells me that rank is an unexpected keyword argument, when I try to have it be barebones just the config file it tells me that I'm missing RANK in the torch environment. So then looking into it I suspect that I'm supposed to say something like python3 train.py --rank 1, etc. That doesn't really make sense to me though, and there isn't any examples I can find of that being the case. Hopefully I'm just making a simple mistake and any help would really be appreciated thank you.

Shikamaru5 avatar Mar 02 '23 17:03 Shikamaru5

Nevermind, turns out that the example provided on github put launch_from_torch and that was providing that error, however, it is providing a new error now that I've changed that. It states that rank must be an integer and I assume that it'll likely do that for the other arguments. Can I put that in the parser.add_argument line, and how would I do that?

Is it something like '--rank':1, '--world_size':1, '--host': host1, etc.

I actually have no clue what port and backend are supposed to be, I've read what the tutorial stuff says and even what initialize.py says but it makes very little sense to me. I don't think it needs backend though because when I just leave config and omit the rest, it only asks for rank, world_size, host, and port.

I'm sorry if these seem like simple questions it's just newer to me, never had to deal with this stuff with the work I've done. Thanks for taking the time to read this and hopefully someone has some pointers on it.

Shikamaru5 avatar Mar 03 '23 17:03 Shikamaru5

Hi, can you provide your environment settings via colossalai -i ?

JThh avatar Mar 04 '23 04:03 JThh

Hey, sorry for the wait, I haven't been able to get on for a while, this is what I get when I run that command:

  /mnt/f/genaitor/majel/imagen# colossalai -i
  /usr/local/lib/python3.10/dist-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
    operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
      registered at aten/src/ATen/RegisterSchema.cpp:6
    dispatch key: Meta
    previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
         new kernel: registered at /dev/null:241 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
    self.m.impl(name, dispatch_key, fn)
  Usage: colossalai [OPTIONS] COMMAND [ARGS]...
  Try 'colossalai --help' for help.
  
  Error: No such option: -i

Which at least some of this pops up everytime I run the program.

Shikamaru5 avatar Mar 07 '23 20:03 Shikamaru5

So as far as I can tell by reading into it, I am supposed to provide the host, rank, world_size, and port in the command line when I run it. Which is something I haven't done as of yet because I have zero clue how I'm supposed to know what those are, or where to find that information. It seems like it should be something set up in the config, and then just called as launch(config='config.py') or launch_from_torch(config='config.py').

In config you could have:

  from colossalai.amp import AMP_TYPE
  
  BATCH_SIZE = 32
  NUM_EPOCHS = 200_000
  HOST = 0
  RANK = i
  WORLD_SIZE = world_size - 1
  PORT = 29500
  
  fp16 = dict(
    mode=AMP_TYPE.TORCH
  )

The documentation has no context of what they actually are, referring to vague definitions in distributed training that I've referenced in my config example, which might make sense if it's totally dependent on your computer and or environment. Not helpful though unfortunately, I'm not trying to sound rude or patronizing but why give pseudo-examples if a person actually needs that information to run your program? Would probably be better even if the programmer didn't have to manually set those up themselves, if they were just included in the backend of the program not having to be messed with.

Anyway, if someone has the solution for me that would be so great, thank you, and again sorry if the tone of this comment isn't fantastic, I'm just feeling a little burnt out with this project, it's quite frustrating that it's simpler to program then it is to set up the programs and environments to allow you to get projects done.

Shikamaru5 avatar Mar 10 '23 19:03 Shikamaru5

After even further study of the documentation I believe what you meant for me to use was the command colossalai check -i which gave me the information provided:

  root@Shikamaru:/mnt/f/genaitor/majel/imagen# colossalai check -i
  /usr/local/lib/python3.10/dist-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
    operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
      registered at aten/src/ATen/RegisterSchema.cpp:6
    dispatch key: Meta
    previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
         new kernel: registered at /dev/null:241 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
    self.m.impl(name, dispatch_key, fn)
  #### Installation Report ####
  
  ------------ Environment ------------
  Colossal-AI version: 0.2.5
  PyTorch version: 1.13.1
  CUDA version: 11.7
  CUDA version required by PyTorch: 11.7
  
  Note:
  1. The table above checks the versions of the libraries/tools in the current environment
  2. If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
  
  ------------ CUDA Extensions AOT Compilation ------------
  Found AOT CUDA Extension: x
  PyTorch version used for AOT compilation: N/A
  CUDA version used for AOT compilation: N/A
  
  Note:
  1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
  2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
  
  ------------ Compatibility ------------
  PyTorch version match: N/A
  System and PyTorch CUDA version match: ✓
  System and Colossal-AI CUDA version match: N/A
  
  Note:
  1. The table above checks the version compatibility of the libraries/tools in the current environment
     - PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
     - System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
     - System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

Unfortunately still haven't found what the values of rank, world_size, host, and port are supposed to be, out of all the examples, and different explanations and tutorials somehow these are always missing values? Also, didn't expect it to but, python train.py --host --rank --world_size <world_size> --port --backend when run, doesn't work.

I can imagine that rank= 0 and world_size=1 since I only have 1 gpu but other than that I have no idea what host and port are supposed to be. If these are values that are supposed to be used in multi-gpu computing that should be stated and given a better example because just writing colossalai.launch(config='./config.py') still wants it to have 'RANK' and similarly so does colossalai.launch_from_torch

Would really appreciate the help because this is pretty confusing for something supposedly easily done with minimal code, and I'm at a standstill with what I'm working on if I can't make colossalai or something similar work.

Shikamaru5 avatar Mar 13 '23 16:03 Shikamaru5

I used pip install colossalai in conda environment and got the same problem. After troubleshooting, I generated a similar error because colossalai was using the default Python for Linux (python 3.6.5 in my case). So I put the conda execution path at the top in ~/.bashrc and reinstalled it using the source code, which solved my problem.

ZionDoki avatar Apr 07 '23 02:04 ZionDoki

Glad to hear it was resolved. We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 27 '23 08:04 binmakeswell