databricks-ml-examples
databricks-ml-examples copied to clipboard
Finetuning Mistral with deepspeed
In fine_tune_deepspeed.py, the first part of the load_training_dataset function looks like this:
def load_training_dataset(
tokenizer,
path_or_dataset: str = DEFAULT_TRAINING_DATASET,
max_seq_len: int = 256,
seed: int = DEFAULT_SEED,
) -> Dataset:
logger.info(f"Loading dataset from {path_or_dataset}")
dataset = load_dataset(path_or_dataset)
logger.info(f"Training: found {dataset['train'].num_rows} rows")
logger.info(f"Eval: found {dataset['test'].num_rows} rows")
The way this function is written, it seems like I have to upload a path to a huggingface dataset. Because this is in Databricks, I would like to pass in a spark dataframe, but load_dataset doesn't accept pyspark dataframes, so I edited to line to read dataset = Dataset.from_spark(path_or_dataset)
but this gave me the error pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_NOT_SET] A master URL must be set in your configuration.
You also cannot pass in an already created dataset object to load_dataset(). Should I just change the code to dataset = path_or_dataset
? Or should I keep the code as-is and pass in a dbfs path to a, dataset object?
If you would like to pass in a Spark dataframe, dataset = Dataset.from_spark(df)
looks good to me.
Regarding the PySparkRuntimeError
, are you running the code in Databricks? It shall set the Spark master for you.
I am running the code in Databricks, although I did clone the repo so I am running it within Repos and not Workspace. Should I just copy over the whole folder into workspace? Or maybe the problem is the type of compute? I was using a multi GPU compute with an ML runtime, I can try again with a single GPU set up.