YOLOv3-CoreML nonMaxsuppression on image still has many repeated bounding boxes

nonMaxsuppression on image still has many repeated bounding boxes

Open fm64hylian opened this issue 3 years ago • 0 comments

Hi, I am trying to deploy a custom YOLOv3 model using a modified version of your repository using only 1 image as input. the mlmodel was successfully created with the right input and output parameters and I tested the keras.h5 after converting from darknet weights to check if everything was ok. then on coreml, after calling let boundingBoxes = yolo.computeBoundingBoxes(features: [features, features, features]) in ViewController.swift and after nonMaxsuppression on YOLO.swift, I was printing the predictions to check if the object was being recognized(printing maxBoundingBoxes, class index, score and coordinates):

count:10
=========================
3
0.9994295
(108.99107360839844, 105.73644256591797, 167.4453125, 173.5439910888672)
=========================
3
0.9994295
(271.0824890136719, 60.53330993652344, 66.54877471923828, 39.44181823730469)
=========================
3
0.9994295
(63.08247756958008, 76.53330993652344, 66.54877471923828, 39.44181823730469)
=========================
3
0.9994295
(351.59149169921875, 10.979837417602539, 17.173877716064453, 26.294544219970703)
=========================
3
0.9994295
(247.59149169921875, 18.97983741760254, 17.173877716064453, 26.294544219970703)
=========================
3
0.9994295
(143.59149169921875, 26.97983741760254, 17.173877716064453, 26.294544219970703)
=========================
3
0.9994295
(39.59149169921875, 34.979835510253906, 17.173877716064453, 26.294544219970703)
=========================
37
0.7621713
(175.49517822265625, 288.18511962890625, 97.42070007324219, 0.0010935374302789569)
=========================
37
0.7621713
(-32.50482177734375, 304.18511962890625, 97.42070007324219, 0.0010935374302789569)
=========================
37
0.7621713
(307.5323486328125, 128.09246826171875, 25.140825271606445, 0.000729024934116751)

There was only 1 item in the picture (tomato is class index number 3)

I don't think it should get that many results if it's from only 1 frame, so maybe when I was modifying ViewController, something went wrong. This is my modified ViewController:

import UIKit
import Vision
import AVFoundation
import CoreMedia
import VideoToolbox

class ViewController: UIViewController {
  @IBOutlet weak var videoPreview: UIView!
  @IBOutlet weak var timeLabel: UILabel!
  @IBOutlet weak var debugImageView: UIImageView!

  let yolo = YOLO()

  var videoCapture: VideoCapture!
  var request: VNCoreMLRequest!

  var boundingBoxes = [BoundingBox]()
  var colors: [UIColor] = []

  let ciContext = CIContext()
  var resizedPixelBuffer: CVPixelBuffer?

  var framesDone = 0
  var frameCapturingStartTime = CACurrentMediaTime()
  let semaphore = DispatchSemaphore(value: 2)

  override func viewDidLoad() {
    super.viewDidLoad()

    setUpBoundingBoxes()
    setUpCoreImage()
    //setUpVision()


    startObjectDetection();
    // NOTE: If you choose another crop/scale option, then you must also
    // change how the BoundingBox objects get scaled when they are drawn.
    // Currently they assume the full input image is used.
    request.imageCropAndScaleOption = .scaleFill
    //setUpCamera()

    //frameCapturingStartTime = CACurrentMediaTime()
  }

  override func didReceiveMemoryWarning() {
    super.didReceiveMemoryWarning()
    print(#function)
  }

  // MARK: - Initialization
  func setUpBoundingBoxes() {
    for _ in 0..<YOLO.maxBoundingBoxes {
      boundingBoxes.append(BoundingBox())
    }

    // Make colors for the bounding boxes. There is one color for each class,
    for r: CGFloat in [0.2, 0.4, 0.6, 0.8, 1.0] {
      for g: CGFloat in [0.3, 0.7, 0.6, 0.8] {
        for b: CGFloat in [0.4, 0.8, 0.6, 1.0] {
          let color = UIColor(red: r, green: g, blue: b, alpha: 1)
          colors.append(color)
        }
      }
    }
  }

func startObjectDetection(tgtImg: UIImage){
        guard let model = try? VNCoreMLModel(for:Yolov3().model) else {
            print("failed to load model")
            return
        }
        let handler = VNImageRequestHandler(cgImage: tgtImg.cgImage!, options: [:])
        let request = createRequest(model: model)
        try? handler.perform([request])
    }


    func createRequest(model: VNCoreMLModel) -> VNCoreMLRequest {
            return VNCoreMLRequest(model: model, completionHandler: { (request, error) in
            DispatchQueue.main.async(execute: {
 
          if let observations = request.results as? [VNCoreMLFeatureValueObservation],
             let features = observations.first?.featureValue.multiArrayValue {

            let boundingBoxes = yolo.computeBoundingBoxes(features: [features, features, features])
            //let elapsed = CACurrentMediaTime() - startTimes.remove(at: 0)

            self.classIndexpredictions(predictions: boundingBoxes)
            //self.show(predictions: boundingBoxes)
          }
          })
        })
    }


  override var preferredStatusBarStyle: UIStatusBarStyle {
    return .lightContent
  }

  func resizePreviewLayer() {
    videoCapture.previewLayer?.frame = videoPreview.bounds
  }

  // MARK: - Doing inference
  func predict(image: UIImage) {
    if let pixelBuffer = image.pixelBuffer(width: YOLO.inputWidth, height: YOLO.inputHeight) {
      predict(pixelBuffer: pixelBuffer)
    }
  }

  func predict(pixelBuffer: CVPixelBuffer) {
    // Measure how long it takes to predict a single video frame.
    //let startTime = CACurrentMediaTime()

    // Resize the input with Core Image to 416x416.
    guard let resizedPixelBuffer = resizedPixelBuffer else { return }
    let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
    let sx = CGFloat(YOLO.inputWidth) / CGFloat(CVPixelBufferGetWidth(pixelBuffer))
    let sy = CGFloat(YOLO.inputHeight) / CGFloat(CVPixelBufferGetHeight(pixelBuffer))
    let scaleTransform = CGAffineTransform(scaleX: sx, y: sy)
    let scaledImage = ciImage.transformed(by: scaleTransform)
    ciContext.render(scaledImage, to: resizedPixelBuffer)

    if let boundingBoxes = try? yolo.predict(image: resizedPixelBuffer) {
      //let elapsed = CACurrentMediaTime() - startTime
      self.show(predictions: boundingBoxes)
    }
  }


  func classIndexpredictions: [YOLO.Prediction]) {
    for i in 0..<boundingBoxes.count {
      if i < predictions.count {
        let prediction = predictions[i]

        // The predicted bounding box is in the coordinate space of the input
        // image, which is a square image of 416x416 pixels. We want to show it
        // on the video preview, which is as wide as the screen and has a 4:3
        // aspect ratio. The video preview also may be letterboxed at the top
        // and bottom.
        let width = view.bounds.width
        let height = width * 4 / 3
        let scaleX = width / CGFloat(YOLO.inputWidth)
        let scaleY = height / CGFloat(YOLO.inputHeight)
        let top = (view.bounds.height - height) / 2

        // Translate and scale the rectangle to our own coordinate system.
        var rect = prediction.rect
        rect.origin.x *= scaleX
        rect.origin.y *= scaleY
        rect.origin.y += top
        rect.size.width *= scaleX
        rect.size.height *= scaleY

        // Show the bounding box.
        let label = String(format: "%@ %.1f", labels[prediction.classIndex], prediction.score * 100)
        let color = colors[prediction.classIndex]
        boundingBoxes[i].show(frame: rect, label: label, color: color)
      } else {
        boundingBoxes[i].hide()
      }
    }
        print("predictions ok")
  }
}

I also modified the filters on YOLO.swift because I only have 38 classes:

    assert(features[0].count == 129*13*13)
    assert(features[1].count == 129*26*26)
    assert(features[2].count == 129*52*52)

and on Helpers.swift I modified my labels but that should not affect the outcome. I am not using UIImage+CVPixelBuffer.swift, VideoCapture.swift and CVPixelBuffer+Helpers.swift because I just receive 1 photo the user took, so I am really only using ViewController, YOLO, BoundingBox and Helpers. This is also the first time I deal with swift code and I don't own a Macos so I cannot compile by myself and cannot validate if the code is ok or not in case something else is wrong.

Is it possible to find out why there is still so many bounding boxes?

here is my model config file yolo3cfg.txt just in case.

UPDATE: so I started over using only the repository and found out that in ViewController:255 we are calling the normal prefict(), but if we want to used a single image, we need the predictUsingVision() where the VNImageRequestHandler is. if I use that function, I get the result above, but if we use the normal Predict, it works just fine with my model. Why is VNImageRequestHandler giving such a different result? thank you.

Oct 21 '20 10:10 fm64hylian

YOLOv3-CoreML YOLOv3-CoreML copied to clipboard

nonMaxsuppression on image still has many repeated bounding boxes

YOLOv3-CoreML
YOLOv3-CoreML copied to clipboard