models icon indicating copy to clipboard operation
models copied to clipboard

CenterNet with Keypoints - TFRecord format requirements

Open salpert-humane opened this issue 3 years ago • 13 comments

Prerequisites

Please answer the following question for yourself before submitting an issue.

  • [x ] I checked to make sure that this issue has not been filed already.

1. The entire URL of the documentation with the issue

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md

2. Describe the issue

There is no sample script or description of how to encode the Keypoint information for CenterNet. I'd like to see a sample script and specific clarifications on whether the Keypoint coordinates need to be normalized by image width and height like with the bounding boxes.

salpert-humane avatar Jun 29 '21 19:06 salpert-humane

Hi,

I am having this same issue. Please help with any guidance on how to do this.

Swazir9449 avatar Jun 29 '21 22:06 Swazir9449

@salpert-humane , yes you need to normalise the keypoint values...

sandhyacs avatar Jul 15 '21 04:07 sandhyacs

@salpert-humane I faced the same issue and took inspiration from the coco TFRecord creation tool here.

@sandhyacs Could you clarify a few things about the keypoints detection framework?

As far as I understand it, keypoints are specific to one object class (hence the need to specify the keypoint_class_name in the proto). Is this the case? If so, am I correct in assuming that the weights in each keypoint task are trained only with samples of the class specified by keypoint_class_name?

Assuming that is indeed the case, what is the relation with the num_keypoints field in input_reader.proto? From what I can see, the input reader always expects num_keypoints values in the ground truth data. From what I see so far, in the case where one has multiple classes with different keypoint definitions (e.g., "person" and "face", each with its own keypoints):

  1. num_keypoints has to be the sum of all keypoints in all classes
  2. one has to define one keypoint task for each class
  3. in the TFRecord, one has to write always num_keypoints values for each keypoint, setting the visibility of the "irrelevat" keypoints for each class to zero, so that the input reader can load the ground truth correctly.

Am I understanding this correctly?

GPhilo avatar Jul 20 '21 15:07 GPhilo

For the record, I clarified those points myself. It is indeed as I described in my previous comment, I am currently training a model with two different keypoint detection tasks successfully and it seems to work as expected.

GPhilo avatar Jul 21 '21 14:07 GPhilo

For the record, I clarified those points myself. It is indeed as I described in my previous comment, I am currently training a model with two different keypoint detection tasks successfully and it seems to work as expected.

Hi, I have no experience using tensorflow (I used Pytorch exclusively before), but at the moment I decided to try to train the CenterNet model with key points. So far, as the first start, we managed to launch the training of CenterNet without key points, since unfortunately I could not find anywhere on the Internet any guidelines for creating annotations (in what form and format the description of key points is needed), what the data hierarchy looks like in the presence of key points, how to convert the received annotations into train.csv and val.csv and then *.record, how to change the configuration file of the pre-trained model with key points, in what form to generate the labels.pbtx file with the markup of key points. I would be extremely grateful if you would take the time to describe the preparatory stages before training the model.

s-nesterov avatar Jul 21 '21 15:07 s-nesterov

Hey! I'll try to summarize what I had to do in a more-or-less concise way, it's not very complex but it's a few steps.

Premise

I'll describe the use-case of a multi-class, multi-keypoint detector. Specifically, I'll consider two distinct classes (C1, C2), each with their set of keypoints. In my scenario both classes have 4 keypoints each, but that's not required that the number of keypoints is the same for each class.

This means: C1 has 4 keypoints (IDs 0,1,2,3), C2 has 4 keypoints (IDs 4,5,6,7), the total number of keypoints is 8.

Label Map preparation

The label map for the task I describe above is as follows:

item {
  id: 1
  name: "C1"
  display_name: "Class 1"
  keypoints {
    id: 0
    label: "TL"
  }
  keypoints {
    id: 1
    label: "TR"
  }
  keypoints {
    id: 2
    label: "BR"
  }
  keypoints {
    id: 3
    label: "BL"
  }
}
item {
  id: 2
  name: "C2"
  display_name: "Class 2"
  keypoints {
    id: 4
    label: "TL"
  }
  keypoints {
    id: 5
    label: "TR"
  }
  keypoints {
    id: 6
    label: "BR"
  }
  keypoints {
    id: 7
    label: "BL"
  }
}

A few things to note:

  1. The name of the keypoints is arbitrary, it just happens that in my detection task I detect corners, so some of them share the same label text. What matters is the ID.
  2. Keypoint IDs start from 0, unlike item IDs that start from 1
  3. Keypoint IDs represent the position of each keypoint in the vector we'll write in the TFRecord file, so here I am implicitly defining the fact that the first 4 keypoints in the vector will be related to C1 and the last 4 to C2. Only one of the two groups will be defined for each sample, the other will contain zeroes (and it won't be used during training).

Dataset preparation

On top of your normal TFRecord data (image, bboxes, classes, maybe masks, other metadata you might want), you need to provide, for each sample:

  1. Normalized coordinates of the keypoints (x / width, y / height)
  2. Visibility of each keypoint (1 if the keypoint is defined in the sample, 0 if it's not). This allows you to deal with truncated images or perspectives where only some keypoints are visible (e.g., if you're detecting face keypoints and only have a profile view of a face, half of the keypoints won't be visible).
  3. Theoretically, the number of keypoints in each bounding box sample (though I couldn't find anywhere in the code where this information was used). I hardcode this to 8 in my application.

As I mentioned before, the size of the vector associated with each bounding box in the TFRecord is always the same (total sum of the number of keypoints in all classes), so in this case it will be 8. You need to take care to put the keypoint coordinates and visibility in the right locations in this vector, depending on each sample's class label.

In my application, I do this as follows (I'm sure there are better ways to code this, I just needed a quick and dirty solution that worked):

# polygons_list (defined elsewhere) is a list of np.arrays of shape [4,2] with the absolute coordinates of the keypoints for each sample
# classes is a list of class labels for each sample
# width, height is the image size

  keypoints_x = [] # X coordinate of keypoints for all bounding boxes defined in this image
  keypoints_y = [] # Y coordinate of keypoints for all bounding boxes defined in this image
  visibilities = []    # Visibility of keypoints for all bounding boxes defined in this image
  for cl, poly in zip(classes, polygons_list):
    if cl == 1: # Class 1 sample
      kx = (poly[:,0] / width).astype(np.float32).tolist() + [0,0,0,0]
      ky = (poly[:,1] / height).astype(np.float32).tolist() + [0,0,0,0]
      v = [1,1,1,1,0,0,0,0]
    else: # Class 2 sample
      kx = [0,0,0,0] + (poly[:,0] / width).astype(np.float32).tolist()
      ky = [0,0,0,0] + (poly[:,1] / height).astype(np.float32).tolist()
      v = [0,0,0,0,1,1,1,1]
    keypoints_x.extend(kx)
    keypoints_y.extend(ky)
    visibilities.extend(v)

# [...] Build feature_dict as usual with encoded image, filename, bboxes, etc

# num_annotated_objects is the number of annotated samples in this image
# _KEYPOINT_NAMES is defined elsewhere as a constant list with value
#  [  b'TL',  b'TR',  b'BR',  b'BL',  b'TL',  b'TR',  b'BR',  b'BL' ] 
# (note that it matches the "label" values in the label map both in text and in order of the entries)

# *_list_feature are the helper functions defined in object_detection.utils.dataset_util

  feature_dict['image/object/keypoint/x'] = (
      float_list_feature(keypoints_x))
  feature_dict['image/object/keypoint/y'] = (
      float_list_feature(keypoints_y))
  feature_dict['image/object/keypoint/num'] = (
      int64_list_feature([8]*num_annotated_objects))
  feature_dict['image/object/keypoint/visibility'] = (
      int64_list_feature(visibilities))
  feature_dict['image/object/keypoint/text'] = (
      bytes_list_feature(_KEYPOINT_NAMES*num_annotated_objects))

Pipeline config

First of all, this will only work with a CenterNet network. Secondly, for each class with keypoints defined, you need to add a keypoint_estimation_task section to the center_net message. For the sample application, the configuration I used is:

    # Inside center_net definition:
    keypoint_label_map_path: "same_path_that_you_have_in_the_input_readers/label_map.pbtxt"
    keypoint_estimation_task {
      task_name: "C1_task"
      task_loss_weight: 1.0
      loss {
        localization_loss {
          l1_localization_loss {
          }
        }
        classification_loss {
          penalty_reduced_logistic_focal_loss {
            alpha: 2.0
            beta: 4.0
          }
        }
      }
      keypoint_class_name: "C1"
      keypoint_regression_loss_weight: 0.1
      keypoint_heatmap_loss_weight: 1.0
      keypoint_offset_loss_weight: 1.0
      offset_peak_radius: 3
      per_keypoint_offset: true
    }

    keypoint_estimation_task {
      task_name: "C2_task"
      task_loss_weight: 1.0
      loss {
        localization_loss {
          l1_localization_loss {
          }
        }
        classification_loss {
          penalty_reduced_logistic_focal_loss {
            alpha: 2.0
            beta: 4.0
          }
        }
      }
      keypoint_class_name: "C2"
      keypoint_regression_loss_weight: 0.1
      keypoint_heatmap_loss_weight: 1.0
      keypoint_offset_loss_weight: 1.0
      offset_peak_radius: 3
      per_keypoint_offset: true
    }

Then you need to tell the train and eval input readers to load keypoints adding num_keypoints: 8 to the input_reader messages (of course, adapt the value to the number of keypoints you have in your data):

train_input_reader: {
  label_map_path: "same_path_that_you_have_in_the_input_readers/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "your_path_to_data/train.record-?????-of-?????"
  }
  num_keypoints: 8
}

eval_input_reader: {
  label_map_path: "same_path_that_you_have_in_the_input_readers/label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "your_path_to_data/validation.record-?????-of-?????"
  }
  num_keypoints: 8
}

Finally, for proper evaluation metrics, you need to add keypoints information to the eval_config message:

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_visualizations: 10
  max_num_boxes_to_visualize: 20
  min_score_threshold: 0.2
  parameterized_metric {
    coco_keypoint_metrics {
      class_label: "C1"
      keypoint_label_to_sigmas {
        key: "TL"
        value: 5
      }
      keypoint_label_to_sigmas {
        key: "TR"
        value: 5
      }
      keypoint_label_to_sigmas {
        key: "BR"
        value: 5
      }
      keypoint_label_to_sigmas {
        key: "BL"
        value: 5
      }
    }
  }
  parameterized_metric {
    coco_keypoint_metrics {
      class_label: "C2"
      keypoint_label_to_sigmas {
        key: "TL"
        value: 5
      }
      keypoint_label_to_sigmas {
        key: "TR"
        value: 5
      }
      keypoint_label_to_sigmas {
        key: "BR"
        value: 5
      }
      keypoint_label_to_sigmas {
        key: "BL"
        value: 5
      }
    }
  }
  keypoint_edge { # TL-TR (C1)
    start: 0
    end: 1
  }
  keypoint_edge { # TR-BR (C1)
    start: 1
    end: 2
  }
  keypoint_edge { # BR-BL (C1)
    start: 2
    end: 3
  }
  keypoint_edge { # BL-TL (C1)
    start: 3
    end: 0
  }
  keypoint_edge { # TL-TR (C2)
    start: 4
    end: 5
  }
  keypoint_edge { # TR-BR (C2)
    start: 5
    end: 6
  }
  keypoint_edge { # BR-BL (C2)
    start: 6
    end: 7
  }
  keypoint_edge { # BL-TL (C2)
    start: 7
    end: 8
  }
}

Edges are for visualization purposes, you can omit them if you want. The sigmas should match the sigma values you define in the training config (5 is the default value if you don't define them, see the center_net.proto definition for the full list of options you can configure)

This should be all, I'll update this if I realize I forgot something. Good luck!

GPhilo avatar Jul 22 '21 10:07 GPhilo

Thank you so much for the description! A lot has become clear, just 80-85% of the questions have become clear with your help! There are only a few points that I wanted to clarify with you. I will update the comment a little later to make it more or less structured.

Task

General description of the task: train a model for one class (hereinafter referred to as CLS) with four key points (corners of the object). This is essentially the same as what you described in your example.

Annotations

I have created a mock-up of a dataset in the MSCOCO format, I will briefly describe the created fields for the image dictionary, annotation dictionary and category dictionary (I will give one example for each). Is this enough for the conversion script to work in cdrecord format?

images: {"id": 1, "file name" :"image1.jpeg", "width": 1440," height": 1080}

annotations: {"id": 1, "image id": 1, "category id": 1, "iscrowd": 0, "ignore": 0, "area": 14613.0, "segmentation": [[331.0, 731.0, 606.0, 754.0, 611.0, 809.0, 339.0, 784.0]], "insert": [306.0, 706.0, 330.0, 128.0], "num_keypoints": "4", "key points": [331.0, 731.0, 1, 606.0, 754.0, 1, 611.0, 809.0, 1, 339.0, 784.0, 1]}

categories: {"id": 1, " name": "CLS", "super category": "any names of the CLS", "key points": ["top_left", "top_right", "bottom_left", "bottom_right"], "skeleton": [[1, 2], [2, 3], [3, 4], [4, 1]]}]

Conversion to tfrecord format

Do I understand correctly that this converter (link) works? The only thing to do is that there are no tf.app. flags in tensorflow 2, so you just need to replace it with the path to the folder with images and the path to the JSON markup files?

Is it true that one JSON file should be specified as the path to the box markup file and the path to the file with the key points markup? Or do you need to create a separate JSON file with only boxes and a separate JSON file with only key points?

Thank you very much in advance, your comments helped me a lot in solving the problem!

s-nesterov avatar Jul 23 '21 07:07 s-nesterov

@s-nesterov Glad it helps :) I'm not sure whether just using the script you linked works, I guess if you have annotations in the same format used in the COCO dataset, it should work, yes (though it might have too many or too few shards, depending on the size of your dataset). In my case, since we use a different annotation format, I wrote my own conversion script based on the linked one (the actual TFRecord is written from a features dictionary and as long as you populate it right, the rest of the script can be anything).

GPhilo avatar Jul 23 '21 10:07 GPhilo

@GPhilo Okey, thank you for the quick answers, I will try! ;)

s-nesterov avatar Jul 23 '21 12:07 s-nesterov

@GPhilo Hi! Thank you again for the recommendations in setting up training and preparing data! Following your instructions, we have successfully trained CenterNet! ;)

s-nesterov avatar Jul 28 '21 15:07 s-nesterov

Hello @GPhilo and @s-nesterov ,

Thank you so much for sharing your knowledge. I appreciate it a lot.

My question is regarding conversion from the COCO data set to TFRecord. The script that @GPhilo provided, is that the full script needed to generate the TFRecord? I am looking at the linked one and cant seem to figure out how to combine them. I also don't fully understand what is going on that script.

@s-nesterov Is it possible you could share the entire script you used to make the conversion? I understand your script will be specific to the keypoints that you used, but it may still help me figure all of this out.

Thanks

Swazir9449 avatar Aug 18 '21 22:08 Swazir9449

Hello @Swazir9449 I used this script to convert markup from the COCO format to tfrecord. In this script, I have commented out only the part of the code that concerns densepose. I gave examples of fields for marking up images and marking up objects in the COCO format above.

Thus, you will only need to specify the paths to the folder with the images - train/val/test; and the annotations file -train.json / val.json / test.json. Since in the same file *.json annotations is present for "bbox" and "keypoints", then the path to the bbox annotations and the path to the keypoints annotations will match for all samples.

I hope this will help you. Successful training ;)

s-nesterov avatar Aug 24 '21 15:08 s-nesterov

Hi. I want to train a CenterNet for 2 classes. One class is described with key points/bounding box and other one has only a bounding box. I do not understand how my labelmap.pbtxt file should look like for the class without keypoints. I have tried this variant:

item { id:2 name: "circle" keypoints { } } and this one:

item { id:2 name: "circle" }

May be somebody has an experience of solving such task?

YellowPanada avatar Nov 15 '23 15:11 YellowPanada