Server icon indicating copy to clipboard operation
Server copied to clipboard

Retraining or partially training the served learner?

Open samuraya opened this issue 5 years ago • 3 comments

Hi @andrewdalpino! I am exploring the RubixML Server currently and finding it really practical.I am curios about one aspect though. Here is the example case:

Let us say I got a REST server that accepts requests from client applications (non-php) and returns predictions.

$server = new RESTServer(‘localhost’, 8080); $server->serve($estimator);

Suppose, I also have a Tester client (php) that might sit within the same environment (same machine server ) or remote location. This Tester client’s job is to periodically send samples to the REST server, get back predictions, check the accuracy and if the accuracy is below the certain % retrain the estimator instance. My question is how would you go about doing it with a current implementation considering the fact that you need to reach to estimator instance? I could extend the REST server class, add a new route to a new, lets say TrainerController and retrain the model. That’s done once the request is in. But how do you deliver that request? I see three ways: a) In Tester client create completely new Guzzle client, package the data in as json in body, set up headers and send to REST server b) a middleware with a conditional check if RPC or REST request etc c) Implement the existing Client interface with a slightly customized implementation logic and let this RESTclient handle all the future request to REST Server.

I think c option looks way cleaner and much reusable. So now the library will have two clients RPCClient and RESTclient. I am just curios if you are considering to add anything like this in future updates? Or would you rather leave it up to an individual developer to come up with ways of figuring this out?

thanks

samuraya avatar Jul 19 '20 18:07 samuraya

Great question @samuraya!

Servers are designed for inference only. Imagine having 10 servers running the same model behind a load balancer. It would be hard, resource consuming, and inconsistent to retrain or partially train those individual instances. The way to go about it would be to retrain or partially train a single instance of the learner offline and then reload all the servers with the newly trained estimator. This ensures that inference is never disrupted in production and that you can scale the number of servers to increase inference throughput where needed. It may help to think of it as a one-way roundabout type of street. I should probably come up with a diagram to help visualize this.

In your system, can you could come up with a way to automate this process? For example, you could have a background job that handles retraining/partial training the learner. You could have another background job that stops the server(s) and restarts it/them with the new estimator.

P.S. Other factors such as training and inference often having different hardware requirements and potentially crashing the server with a train() or partial() call if it runs out of memory also weigh into doing things in this 'one-way' circular street kind of way.

andrewdalpino avatar Jul 21 '20 23:07 andrewdalpino

got it! My setup is such that there is a background supervisor that runs "retrain" and "test" scripts but it does so on the live estimator server. I will experiment with it and see the performance and stress resistance of such infrastructure and try to find where the breaking point is. Performance will certainly depend on the hardware but that's where I will have to make bunch of assumptions and try to avoid generalizing the outcome. I will probably share some of my findings later

samuraya avatar Jul 22 '20 12:07 samuraya

That's a cool setup, definitely share your findings and do join us in our Telegram channel if you'd like https://t.me/RubixML

andrewdalpino avatar Jul 23 '20 09:07 andrewdalpino