databricks-ml-examples icon indicating copy to clipboard operation
databricks-ml-examples copied to clipboard

Finetuning Mistral with deepspeed

Open achangtv opened this issue 1 year ago • 2 comments

In fine_tune_deepspeed.py, the first part of the load_training_dataset function looks like this:

def load_training_dataset(
    tokenizer,
    path_or_dataset: str = DEFAULT_TRAINING_DATASET,
    max_seq_len: int = 256,
    seed: int = DEFAULT_SEED,
) -> Dataset:
    logger.info(f"Loading dataset from {path_or_dataset}")
    dataset = load_dataset(path_or_dataset)
    logger.info(f"Training: found {dataset['train'].num_rows} rows")
    logger.info(f"Eval: found {dataset['test'].num_rows} rows")

The way this function is written, it seems like I have to upload a path to a huggingface dataset. Because this is in Databricks, I would like to pass in a spark dataframe, but load_dataset doesn't accept pyspark dataframes, so I edited to line to read dataset = Dataset.from_spark(path_or_dataset) but this gave me the error pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_NOT_SET] A master URL must be set in your configuration. You also cannot pass in an already created dataset object to load_dataset(). Should I just change the code to dataset = path_or_dataset? Or should I keep the code as-is and pass in a dbfs path to a, dataset object?

achangtv avatar Jan 29 '24 18:01 achangtv

If you would like to pass in a Spark dataframe, dataset = Dataset.from_spark(df) looks good to me.

Regarding the PySparkRuntimeError, are you running the code in Databricks? It shall set the Spark master for you.

es94129 avatar Jan 31 '24 00:01 es94129

I am running the code in Databricks, although I did clone the repo so I am running it within Repos and not Workspace. Should I just copy over the whole folder into workspace? Or maybe the problem is the type of compute? I was using a multi GPU compute with an ML runtime, I can try again with a single GPU set up.

achangtv avatar Jan 31 '24 14:01 achangtv