label-studio-ml-backend get_result_from_job_id AssertionError while initializing redeployed LS/ML NER backend

I am trying to deploy NER example model trained on my local machine along with the Label Studio project to another machine. I've gone through following steps:

Recreated Label Studio and ML Backend environments similarly as on a target machine
Copied folder with the model itself (folder named with just integers) to target machine ML Backend folder.
Extracted content (data, annotations and predictions) of the project through Label Studio API into json format (using ...export?exportType=JSON&download_all_tasks=true command)
Imported project json file into the newly created Label Studio project. When trying to initialize and pair LS and ML Backend on a new machine, i am getting : [2022-05-30 10:18:56,133] [ERROR] [label_studio_ml.model::get_result_from_last_job::128] 1647350146 job returns exception: Traceback (most recent call last): File "/Users/user/Projects/label-studio-ml-backend/label_studio_ml/model.py", line 126, in get_result_from_last_job result = self.get_result_from_job_id(job_id) File "/Users/user/Projects/label-studio-ml-backend/label_studio_ml/model.py", line 108, in get_result_from_job_id assert isinstance(result, dict) AssertionError and it keeps repeating for each job

Should any additional steps be performed during deploy of the project/model to other environments ?

I've tried with following LS versions (1.1.1 - my initial one, 1.4.1post1 - most recent one) and the most current code base of ML backend. Using Python 3.8 and MacOS for both source and target environments.

May 30 '22 08:05 wojnarabc

Hi @wojnarabc This error indicates that training of your model ended without any contrete result. ner example doesn't have fit method, but you can add it like that:

def fit(self, completions, workdir=None, **kwargs):
    import random
    return {'random': random.randint(1, 10)}

and the error should disappear. I will add this stub to the ner example in future release.

If you are using TransformersBasedTagger - please check that fit method ends with appropriate result.

Jun 01 '22 11:06 KonstantinKorotaev

Hello @KonstantinKorotaev, there is a fit method in example in ...examples/ner/ner.py folder starting in line 461. TransformersBasedTagger is part of it.

Jun 03 '22 11:06 wojnarabc

I run into the exact same problem with my custom backend.

I am in the process of upgrading my system to the latest LS and backend. Everything was working fine with LS 1.1.1 and the backend from a year ago.

After training, another job is sent for some reason, and then train_output is being cleared causing the backend to lose the knowledge about the last trained model.

I already set LABEL_STUDIO_ML_BACKEND_V2_DEFAULT = True

 'train_output': {'model_path': '././my_backend/5.1655881660/1658156596'},
 'value': 'image'}
[2022-07-18 17:03:37,760] [INFO] [werkzeug::_log::225] 192.168.123.133 - - [18/Jul/2022 17:03:37] "POST /train HTTP/1.1" 201 -
[2022-07-18 17:03:37,781] [INFO] [werkzeug::_log::225] 192.168.123.133 - - [18/Jul/2022 17:03:37] "GET /health HTTP/1.1" 200 -
[2022-07-18 17:03:37,787] [ERROR] [label_studio_ml.model::get_result_from_last_job::130] 1658156583 job returns exception: 
Traceback (most recent call last):
  File "/home/USER/.virtualenvs/ls-1.5/lib/python3.8/site-packages/label_studio_ml/model.py", line 128, in get_result_from_last_job
    result = self.get_result_from_job_id(job_id)
  File "/home/USER/.virtualenvs/ls-1.5/lib/python3.8/site-packages/label_studio_ml/model.py", line 110, in get_result_from_job_id
    assert isinstance(result, dict)
AssertionError

Jul 18 '22 15:07 jrdalenberg

So, to be more clear, I based my custom backend on pytorch_transfer_learning.py. I had to set LABEL_STUDIO_ML_BACKEND_V2_DEFAULT = True in model.py, because otherwise I run into the issue described in #118.

Now, when I monitor train_output after training, label studio initializes the backend with the correct train_output. But immediately after, another initialization event follows with an empty train_output.

I made an ugly workaround for this issue using an OS environment variable that saves the location of last trained model locally. Note that this causes issues if users switch between projects and it won't redeploy an existing model.


# ...

from label_studio_ml import model
model.LABEL_STUDIO_ML_BACKEND_V2_DEFAULT = True

# ...

class ImageClassifierAPI(model.LabelStudioMLBase):
   def __init__(self, label_config=None, train_output=None,**kwargs):
       super(ImageClassifierAPI, self).__init__(**kwargs)
       
      # ...
       
       if self.train_output:
           print(f"trying to use {self.train_output['model_path']} as model path")
           self.model = ImageClassifier(self.classes, self.boxType)
           self.model.load(self.train_output['model_path'])
       elif os.environ.get("LAST_TRAINED_MODEL") is not None:
           model_path = os.environ.get("LAST_TRAINED_MODEL")
           print(f"trying to use {model_path} as model path")
           self.model = ImageClassifier(self.classes, self.boxType)
           self.model.load(model_path)
       else: 
           self.model = ImageClassifier(self.classes, self.boxType)

   # ...

   def fit(self, annotations, workdir=None, batch_size=12, num_epochs=100, **kwargs):
       self.model.train(annotations, workdir, self.boxType, batch_size=batch_size, num_epochs=num_epochs)

       # Save workdir in environment
       os.environ["LAST_TRAINED_MODEL"] = workdir

       return {'model_path': workdir}

Having to manually change LABEL_STUDIO_ML_BACKEND_V2_DEFAULT = True and Label studio loosing the returned dict from the fit function is a bit awkward to work with though.

Jul 25 '22 09:07 jrdalenberg

@jrdalenberg LABEL_STUDIO_ML_BACKEND_V2_DEFAULT this variable tells your ML backend that you are using active learning cycle. Starting from version 1.4.1 of Label Studio training is invoked with webhooks. Please check this documentation.

def fit(self, annotations, workdir=None, batch_size=12, num_epochs=100, **kwargs): self.model.train(annotations, workdir, self.boxType, batch_size=batch_size, num_epochs=num_epochs)

Your fit method seems to be expecting annotations, but from version 1.4.1 there will be an event instead of annotations. Maybe this is leading to empty train_output.

If you want to switch this logic off - disable webhook in LS. Check this example if you want to use active learning cycle.

Jul 26 '22 16:07 KonstantinKorotaev

Your fit method seems to be expecting annotations, but from version 1.4.1 there will be an event instead of annotations. Maybe this is leading to empty train_output.

I see. It’s still odd that train_output is returned once and then cleared right after, though.

Check this example if you want to use active learning cycle.

Thanks! I did not see this in the object recognition examples. I’ll test it out after my vacation 😄

Jul 27 '22 06:07 jrdalenberg

same bug

/opt/label-studio-ml-backend/coco-detector/mmdetection.py(137)fit() -> logger.info(f'tasks={tasks}, workdir={workdir}, kwargs={kwargs}') (Pdb) bt /usr/lib/python3.10/threading.py(966)_bootstrap() -> self._bootstrap_inner() /usr/lib/python3.10/threading.py(1009)_bootstrap_inner() -> self.run() /usr/lib/python3.10/threading.py(946)run() -> self._target(*self._args, **self._kwargs) /usr/lib/python3.10/socketserver.py(683)process_request_thread() -> self.finish_request(request, client_address) /usr/lib/python3.10/socketserver.py(360)finish_request() -> self.RequestHandlerClass(request, client_address, self) /usr/lib/python3.10/socketserver.py(747)init() -> self.handle() /opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/werkzeug/serving.py(342)handle() -> BaseHTTPRequestHandler.handle(self) /usr/lib/python3.10/http/server.py(425)handle() -> self.handle_one_request() /opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/werkzeug/serving.py(374)handle_one_request() -> self.run_wsgi() /opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/werkzeug/serving.py(319)run_wsgi() -> execute(self.server.app) /opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/werkzeug/serving.py(308)execute() -> application_iter = app(environ, start_response) /opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/flask/app.py(2464)call() -> return self.wsgi_app(environ, start_response) /opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/flask/app.py(2447)wsgi_app() -> response = self.full_dispatch_request() /opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/flask/app.py(1950)full_dispatch_request() -> rv = self.dispatch_request() /opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/flask/app.py(1936)dispatch_request() -> return self.view_functionsrule.endpoint /opt/label-studio-ml-backend/label_studio_ml/exceptions.py(39)exception_f() -> return f(*args, **kwargs) /opt/label-studio-ml-backend/label_studio_ml/api.py(93)_train() -> job = _manager.train(annotations, project, label_config, **params) /opt/label-studio-ml-backend/label_studio_ml/model.py(711)train() -> job_result = cls.train_script_wrapper( /opt/label-studio-ml-backend/label_studio_ml/model.py(667)train_script_wrapper() -> train_output = m.model.fit(data_stream, workdir, **train_kwargs) /opt/label-studio-ml-backend/coco-detector/mmdetection.py(137)fit() -> logger.info(f'tasks={tasks}, workdir={workdir}, kwargs={kwargs}')

022-08-19 03:17:21,961] [ERROR] [label_studio_ml.model::get_result_from_last_job::131] 1660820149 job returns exception: Traceback (most recent call last): File "/opt/label-studio-ml-backend/label_studio_ml/model.py", line 129, in get_result_from_last_job result = self.get_result_from_job_id(job_id) File "/opt/label-studio-ml-backend/label_studio_ml/model.py", line 111, in get_result_from_job_id assert isinstance(result, dict) AssertionError

Aug 19 '22 05:08 idreamerhx

# train, see https://labelstud.io/guide/ml_create.html
def fit(self, tasks, workdir=None, **kwargs):
    # Retrieve the annotation ID from the payload of the webhook event
    # Use the ID to retrieve annotation data using the SDK or the API
    # Do some computations and get your model
    #import pdb; pdb.set_trace()
    logger.info(f'tasks={tasks}, workdir={workdir}, kwargs={kwargs}')
    return {'checkpoints': '3.1660708258/1660885087/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth', 'model_file': "3.1660708258/1660885087/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth", "model_version":"3.1660708258/1660885087", 'classes': 80}
    ## JSON dictionary with trained model artifacts that you can use later in code with self.train_output

Load new model from: /opt/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth load checkpoint from local path: /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth Load new model from: /opt/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth load checkpoint from local path: /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth [2022-08-19 05:06:31,709] [INFO] [werkzeug::_log::225] 192.168.1.7 - - [19/Aug/2022 05:06:31] "POST /train HTTP/1.1" 201 - [2022-08-19 05:06:32,102] [INFO] [werkzeug::_log::225] 192.168.1.7 - - [19/Aug/2022 05:06:32] "GET /health HTTP/1.1" 200 - Load new model from: /opt/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth load checkpoint from local path: /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth [2022-08-19 05:06:33,168] [INFO] [werkzeug::_log::225] 192.168.1.7 - - [19/Aug/2022 05:06:33] "POST /setup HTTP/1.1" 200 -

seems return values not work

Aug 19 '22 05:08 idreamerhx

Hi @idreamerhx

seems return values not work

I don't see any error in your log, what do you mean by not work?

assert isinstance(result, dict) AssertionError

This means that your training procedure didn't return results, could you please check what your fit method returns?

Aug 19 '22 09:08 KonstantinKorotaev

This should be a bug somehow. Because the results should be a list of dict but it tells me it is wrong.

Sep 03 '22 06:09 sjtuytc

@sjtuytc could you please clarify what do you mean?

Sep 07 '22 09:09 KonstantinKorotaev

I mean this assertion is too restrictive. A dummy result would tricker the assertion error. This should not be desired.

Sep 07 '22 11:09 sjtuytc

I'm also getting the same assertion error problem. Does the model always have to return a dictionary? I'm following the examples provided in Label Studio and it seems like lists of dicts are also allowed: https://github.com/heartexlabs/label-studio-ml-backend/blob/db6d1d6d3efde1db503532f1f77a6977a7f100d2/label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py#L93

Feb 28 '23 00:02 seanswyi

Hi @seanswyi Predictions should be a list of dicts. Assertion error is about training job failure.

Mar 01 '23 09:03 KonstantinKorotaev

Errors after setting LABEL_STUDIO_ML_BACKEND_V2_DEFAULT=True in model.py:

[2023-03-14 10:37:49,673] [ERROR] [label_studio_ml.model::get_result_from_last_job::132] 1678761465 job returns exception: Job 1678761465 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log. Traceback (most recent call last): File "label-studio-ml-backend/label_studio_ml/model.py", line 130, in get_result_from_last_job result = self.get_result_from_job_id(job_id) File "label-studio-ml-backend/label_studio_ml/model.py", line 111, in get_result_from_job_id assert isinstance(result, dict), f"Job {job_id} was finished unsuccessfully. No result was saved in job folder."
AssertionError: Job 1678761465 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log. [2023-03-14 10:37:49,673] [ERROR] [label_studio_ml.model::get_result_from_last_job::132] 1678761463 job returns exception: Job 1678761463 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log. Traceback (most recent call last): File "label-studio-ml-backend/label_studio_ml/model.py", line 130, in get_result_from_last_job result = self.get_result_from_job_id(job_id) File "label-studio-ml-backend/label_studio_ml/model.py", line 111, in get_result_from_job_id assert isinstance(result, dict), f"Job {job_id} was finished unsuccessfully. No result was saved in job folder."
AssertionError: Job 1678761463 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.

How should we work around?

Mar 14 '23 02:03 ZhangYuef

Assertion error is about training job failure.

Do you have folders with job results 1678761463 and 1678761465?

Mar 14 '23 07:03 KonstantinKorotaev

Hello, I have the same problem:

I've created a custom backend based on example in mmdetection.py. I don't use active learning (I think, i did not set that up). Every time I will switch to next image for annotation, I am getting this output in console:

[2023-03-17 09:30:05,107] [ERROR] [label_studio_ml.model::get_result_from_last_job::132] 1679041803 job returns exception: Job 1679041803 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.
Traceback (most recent call last):
  File "d:\coding\developement\python\label-studio-ml-backend\label_studio_ml\model.py", line 130, in get_result_from_last_job
    result = self.get_result_from_job_id(job_id)
  File "d:\coding\developement\python\label-studio-ml-backend\label_studio_ml\model.py", line 111, in get_result_from_job_id
    assert isinstance(result, dict), f"Job {job_id} was finished unsuccessfully. No result was saved in job folder." \
AssertionError: Job 1679041803 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.

The exception is caused because _get_result_from_job_id is returning None because os.path.exists(result_file) is returning False:

    def _get_result_from_job_id(self, job_id):
        """
        Return job result or {}
        @param job_id: Job id (also known as model version)
        @return: dict
        """
        job_dir = self._job_dir(job_id)
        if not os.path.exists(job_dir):
            logger.warning(f"=> Warning: {job_id} dir doesn't exist. "
                           f"It seems that you don't have specified model dir.")
            return None
        result_file = os.path.join(job_dir, self.JOB_RESULT)
        if not os.path.exists(result_file): <--- THIS CHECK RESULTS IN FLSE
            logger.warning(f"=> Warning: {job_id} dir doesn't contain result file. "
                           f"It seems that previous training session ended with error.")
            return None
        logger.debug(f'Read result from {result_file}')
        with open(result_file) as f:
            result = json.load(f)
        return result

What is strange for me is that I do have job_result.json file in the required directory but probably it is not there when the check occurs? It must be created later. The contents of file is empty json.

and here is my predict() method:

    def predict(self, tasks, **kwargs):
        assert len(tasks) == 1
        task = tasks[0]

        image_url = self._get_image_url(task)
        image_path = self.get_local_path(image_url)
        image = cv2.imread(image_path)
        output = Inference.infer_from_image(self.model, image)
        indices, boxes, confidences = Inference.filter_outputs(self.config, image, output)

        results = []
        all_scores = []
        for i in indices:
            score = confidences[i]
            # print(f'{confidences[i]:.2f}')
            x, y, width, height = self.convert_to_ls(boxes[i][0], boxes[i][1], boxes[i][2], boxes[i][3], image.shape[1], image.shape[0])
            results.append({
                'from_name': self.from_name,
                'to_name': self.to_name,
                'type': 'rectanglelabels',
                'value': {
                    'rectanglelabels': ['head'],
                    'x': x,
                    'y': y,
                    'width': width,
                    'height': height
                },
                'score': float(score)
            })
            all_scores.append(score)
            avg_score = sum(all_scores) / max(len(all_scores), 1)
        return [{
            'result': results,
            'score': float(avg_score)
        }]

But I dont think this is due to predict() as I've said earlier, the check: if not os.path.exists(result_file): is failing for some reason.

Mar 17 '23 08:03 TrueWodzu

Hi @TrueWodzu

What is strange for me is that I do have job_result.json file in the required directory but probably it is not there when the check occurs? It must be created later. The contents of file is empty json.

This file is created during training session. If it is empty - training session ended with error.

Mar 17 '23 12:03 KonstantinKorotaev

Hi @KonstantinKorotaev thank you for your answer. So is this bug? Because it seems like the file is required but at the same time, some of us don't want to train. How can I prevent it from happening?

Mar 17 '23 13:03 TrueWodzu

@KonstantinKorotaev I don't have training turned on so def _get_result_from_job_id(self, job_id): should handle a case where training is not on and do not throw exception?

A simple change in the code solves the problem of exception:

def _get_result_from_job_id(self, job_id):
    """
    Return job result or {}
    @param job_id: Job id (also known as model version)
    @return: dict
    """
    job_dir = self._job_dir(job_id)
    if not os.path.exists(job_dir):
        logger.warning(f"=> Warning: {job_id} dir doesn't exist. "
                       f"It seems that you don't have specified model dir.")
        return None
    result_file = os.path.join(job_dir, self.JOB_RESULT)
    if not os.path.exists(result_file):
        logger.warning(f"=> Warning: {job_id} dir doesn't contain result file. "
                       f"It seems that previous training session ended with error.")
        return {} <---- A simple change here from None to empty json solves the issue.
    logger.debug(f'Read result from {result_file}')
    with open(result_file) as f:
        result = json.load(f)
    return result

Mar 18 '23 09:03 TrueWodzu

Hi @TrueWodzu I have added environment variable to integrate your changes in branch. Please check if it's appropriate solution for you.

Mar 20 '23 12:03 KonstantinKorotaev

Hi @KonstantinKorotaev many thanks for the change! While it definitely works for me, I am just wondering is this a correct approach? What I mean by that is, is it really required right now to have fit method defined? Because in LabelStudio project I have training turned off:

obraz

So if I have training turned off, then there should be no exception about result_file?

Mar 20 '23 13:03 TrueWodzu

Hi @TrueWodzu

So if I have training turned off, then there should be no exception about result_file?

Yes, but intention for this error is to get message in case of anybody tried to load model that wasn't trained successfully. I will add this flag so anybody can ignore such errors in future.

Mar 21 '23 08:03 KonstantinKorotaev

label-studio-ml-backend label-studio-ml-backend copied to clipboard

get_result_from_job_id AssertionError while initializing redeployed LS/ML NER backend

label-studio-ml-backend
label-studio-ml-backend copied to clipboard