Qwen3-Coder icon indicating copy to clipboard operation
Qwen3-Coder copied to clipboard

Training dataset format: npy or jsonl?

Open justlovebarbecue opened this issue 5 months ago • 1 comments

Hi,

I am trying to use a demo dataset to test the training code. But the instruction is not clear enough. Before running the training code, I did the "binarize_data" step, for this one, which format I should use? npy or jsonl, if it is jsonl, it looks like there is no "input_ids" and "label" for the dataloader parts for following training part. If it is npy, i meet a problem about uint format cannot be converted shown as below:

self.input_ids = [torch.tensor(example["input_ids"], dtype=torch.long) for example in self.input_ids if len(example["input_ids"]) < args.model_max_length] TypeError: can't convert np.ndarray of type numpy.uint32. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Any clue on this issue? or the only thing needed is forcely transfer the data format to make it NOT as "uint"?

Thanks!

justlovebarbecue avatar Jul 13 '25 21:07 justlovebarbecue

upgrade your numpy may solve your problems

cyente avatar Jul 29 '25 13:07 cyente