argos-train Data priority, incremental training?

Hi there!

I would like to use the data currently provided in data-index.json, but at the same time, I would like to use my custom data. Can I tell the script to generate a model considering my custom data is more relevant / has a bigger priority?
Let's say I have one large dataset I am using all the time, and then I have multiple smaller datasets which I would like to train different models for each. Is something like an incremental build possible, so I would reuse some previous output and just "append" my custom data to save some training time and resources?

Thanks!

Oct 29 '22 06:10 JanCizmar

There's no direct support for this but you can accomplish this by modifying argostrain/train.py.

I would add input("Downloaded Argos Data") after the data has been downloaded here and then append your custom data to run/source and run/target.

You could also train one base model and then fine tune it using custom data. However, this will also require using custom code.

I want to improve using custom data and fine tuning so if anyone has suggestions or pull requests they're appreciated.

Nov 04 '22 13:11 PJ-Finlay

Would incremental training also be possible with the suggestions from libretranslate? I think the base models that are available are quite good already, but having the feedback from libretranslate incorporated might make corner cases even better - this might depend on the actual use case (e.g. a medical use case might need a different fine-tuning than a scuba-diving one, to pick random examples).

Having a possibility to quickly improve the base model without having to use a high-power machine for training the complete model again with 99.9% same input data would be great!

Jul 27 '23 07:07 martin-leoorg