Question on validation dataset creation/agent pipeline.

Open mmrbulbul opened this issue 1 month ago • 1 comments

For the Kaggle agent, during dataset preparation, we create test and train data by splitting the original train data. If I'm not mistaken, this newly created test data is being used in step four of the pipeline

Step 4 : Validation on Test Set or Kaggle 📉

Validate the newly developed model using the test set or Kaggle dataset.

Assess the model’s effectiveness and performance based on the validation results.

Given the creation of validation dataset is almost as important as model creation and requires understanding of the data, shouldn't it also be part of the pipeline?

Nov 14 '25 14:11 mmrbulbul

Hi, @mmrbulbul , Thanks for the thoughtful question!

The kaggle scenario is actually a simplified subset of the broader data_science scenario, which is why dataset splitting happens as a lightweight preprocessing step there. This is also why we generally recommend using the data_science scenario in our main documentation.

You can enable the following setting in your .env file:

DS_SAMPLE_DATA_BY_LLM=True

With this enabled, the data_science pipeline will use an LLM to perform dataset splitting automatically as part of the full workflow.

Thanks again for your suggestion—please feel free to share more ideas!

Nov 17 '25 04:11 SunsetWolf