DIGITS icon indicating copy to clipboard operation
DIGITS copied to clipboard

How can I use DetectNet for custom size data ?

Open erogol opened this issue 8 years ago • 55 comments

I try to use DetectNet for a third party data with 448x448 size images. What are the parameters need to be changed for this custom problem?

erogol avatar Aug 16 '16 11:08 erogol

It's certainly possible to adjust DetectNet to work with other image sizes, but not easy. @jbarker-nvidia got it to work with 1024x512 images in his blog post: https://devblogs.nvidia.com/parallelforall/detectnet-deep-neural-network-object-detection-digits/

Unfortunately, it's not a simple process. And even if you get it to run without errors, you need to understand what's going on pretty well to get it to actually converge to a solution for your data.

Off the top of my head, here are some places to start:

  • Adjust the image size for your problem (here, here, etc.)
  • Adjust the stride according to the smallest size object you'd like to detect (here)
  • If you do change the stride, I think there's another parameter near the end that needs adjusting (this one?) so that the network output size matches the label size

lukeyeager avatar Aug 16 '16 17:08 lukeyeager

I am also trying to adapt detectnet for my own dataset (example 1024x1024 size images) with custom object sizes (around 192x192)

The issue is that in the blog post, the full modified prototxt is not published so I'm having a lot of trouble recalculating what I need to modify:

If I'm correct:

Adjust Image size: In L79-80 and L118-119 { (...) xsize: myImageXSize (or myCropSize if crop) ysize: myImageXSize (or myCropSize if crop) }

Adjust stride for detecting custom classes In L73 { stride: myMinObjectSize } But there I can't understand which parameters I need to tune as 1248 - 352 looks like the original image size but not that much. In L2504 { param_str : '1248, 352, 16, 0.6, 3, 0.02, 22' } I would then guess param_str = 'xSize,ySize,Stride,?,?,?,?' but the rest...

Same for L2519 and L2545

However, I can't understand what I would need to modify: L2418. does not seem to need modification as it is the bounding box regressor so it should output 4 objects. (unless I'm mistaken).

I would love adding documentation to using DetectNet & Digits with a custom dataset, however I can't really understand everything yet.

Regards

fchouteau avatar Aug 26 '16 13:08 fchouteau

For 1024x1024 images and target objects around 192x192 you probably don't need to adjust the stride initially. DetectNet with default settings should be sensitive to objects in the range 50-400px. That means that you can just replace the 1248x348/352 everywhere by 1024x1024 and it should "just work".

Something I found that helped accuracy when I modified image sizes was to use random cropping in the "train_transform" - modify the image_size_x and image_size_y parameters to, say, 512 and 512 and set crop_bboxes: false.

jon-barker avatar Aug 26 '16 13:08 jon-barker

@jbarker-nvidia , Hi I did what you said (set crop_bboxes: false) and it improved my mAP from 1.6 to 14 percent, kindly take a look at my question #1011 , Thank you.

szm-R avatar Aug 28 '16 17:08 szm-R

@jbarker-nvidia Thank you for your input, much appreciated. I have one more question however: I was also thinking about sampling random crops from image (in my case 512x512) so setting up image_size_x: 512 image_size_y: 512 crop_bboxes: false in detectnet_groundtruth_param however, in the deploy data and later layers, should I specify 1024x1024 or 512x512 as image size ? My guess would be to put 1024x1024 on before the train / val transform and at the end when calculating maP and clustering bboxes however I just watend to be sure.

Regards

fchouteau avatar Aug 29 '16 12:08 fchouteau

@fchouteau Set image_size_x: 512 image_size_y: 512 crop_bboxes: false in name: "train_transform", i.e. the type: "DetectNetTransformation" layer applied at training time only. Everywhere else leave the image size as 1024x1024. That way cropping will only be applied at training time and validation and test will use the full-size 1024x1024 images. This works fine because the heart of DetectNet is a fully-convolutional network so can be applied to varying image sizes.

jon-barker avatar Aug 29 '16 13:08 jon-barker

Hello everyone,

I want to use DIGITS (detectnet) + CAFFE to detect objects in my own dataset. I read some posts about adapting some settings in detectnet to use it for training and detection in a custom dataset. But apparently, most of the mentioned datasets consists of images with more or less the same dimensions for all images. My case is a bit different from the comments that I found …

I have 3 different object classes which I want to detect in images : classA, classB and classC.

For each object class, I have 3000 training images available (=> so 9000 in total), and 1500 validation images (=> 4500 in total). Those images are ROI’s (regions of interest from other images) that I manually cropped in the past, so the whole (training) image consists of one specific object. The smallest dimension of a training or validation image is always 256 (e.g. 256x256, 256x340, 256x402, 256x280, 340x256, … --> note : not a perfect square, but never a long rectangle like 256 x 1024 or 256 x 800 ; always more or less a square shape). Since all images consist of cropped regions (around an object) from other images, the label files look like this :

108 0.0 0 0.0 0.000000 0.000000 391.000000 255.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 108 0.0 0 0.0 0.000000 0.000000 255.000000 459.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 108 0.0 0 0.0 0.000000 0.000000 411.000000 255.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 etc.

-> image class = ‘108’ and bounding box of object in the image = image dimensions

I want to train an object detection model(s) so I can detect those 3 objects (if present) in unknown test images, images that were not cropped beforehand. Dimensions of those images can be different (e.g. 800x600, 1200x800, 1486x680, … can be about everything). Remark : in these unknown images -if the object appears in it-, the whole image can consist of the object, or the object can be a smaller part of the image (not covering the whole image).

My first question : is it necessary to make all the training / validation images have the same dimensions (e.g. 256 x 256), or can I solve it by setting some parameters (pad image? resize image?) to a specific dimension while creating a dataset? It’s not clear to me what those parameters exactly imply.

Second question : how about the test images that can have about any dimension ; do I have to resize them before analyzing or not?

If I get it right, I have to make some changes :

A] While creating a dataset, in the DIGITS box, change :

  • Pad image
  • Resize image

B] In detectnet_network.prototxt (dim:384 and dim:1248), here.

In the following lines, image_size_x:1248 and image_size_y:384 and crop_bboxes true/false are mentioned : here and here.

And in the following line, dimensions (1248, 352) are also used : here, here and here.

At this moment, it is not clear to me how to set these options for my specific case …

With kind regards.

JVR32 avatar Sep 06 '16 08:09 JVR32

@JVR32 DetectNet is not designed to work with datasets of the kind that you describe. A dataset for DetectNet should be images where the object you wish detect is some smaller part of the image and has a bounding box label that is a smaller part of the image. Some of these images could have objects that take up a large part of the image, but not all of them as it is important for DetectNet to be able to learn what "non-object" image/pixels looks like around a bounding box. That ability to learn a robust background model is why DetectNet can work well. Also note that you will need to modify the statndard DetectNet to work for multi-class object detection.

If you have access to the original dataset that you cropped the objects from then you should create a training dataset from those images and use the crop locations as the bounding box annotations to use DetectNet.

If you only have the cropped images to train on then you should just train an image classification network but make sure you train a Fully Convolutional Network (FCN). See here. An FCN for image classification can then be applied to a test image of any size and the output will be a "heatmap" of where objects might be present in the image. Note that this approach will not be as accurate as DetectNet and will suffer from a higher false alarm rate unless you also add non-object/background training samples to your dataset.

jon-barker avatar Sep 06 '16 12:09 jon-barker

I guess I got stuck then. I already used the (cropped) training/validation images before (together with a set of negative images that didn't contain the specified objects) to do image classification. => 4 image classes : negative, classA, classB, classC

And that worked quite good, but ... it worked best when the whole image was taken by the object, or if 'most of the image' was taken by the object. If the object was only a smaller part of the image, the image often was considered as 'negative' (->= not an image of the object we look for).

That's why I hoped that using detection instead of classification would improve the results. Main purpose is to detect if a certain object is present in an image (taking the whole image or only a smaller part of the image). Multi-class is not important, it can be done in multiple checks (-> classA present or not? ; classB present or not? ; classC present or not?).

Unfortunately, I manually cropped all the images in the past, so I don't have the crop locations in the original images :-( .

JVR32 avatar Sep 06 '16 13:09 JVR32

Suppose I had done it differently, and I would have 9000 training images and 4500 validation images with dimensions 640x640, and the wanted objects were smaller parts of those images :

e.g.
image1 = 640x640 with object ROI = (10, 10, 200, 250) = (top, left, bottom, right) image2 = 640x640 with object ROI = (200, 10, 400, 370) image3 = 640x640 with object ROI = (150, 150, 400, 400) ...

My test images could still have different dimensions : e.g. 800x600, 1200x800, 1486x680, …

Which settings should I provide while creating a dataset in the DIGITS box :

  • Pad image?
  • Resize image?

Are those totally independent from the possible dimensions of the test images (-> leave pad image empty and put 640x640 for resize image) or not?

And what about the dim and image_size_x and image_size_y parameters in detectnet_network.prototxt ? Now 384/352 and 1248 are used, but what if the dimensions of the test images can be different, what do I have to put for those parameters?

JVR32 avatar Sep 06 '16 13:09 JVR32

@jbarker-nvidia I explored the code and it is not clear to me why do you set "crop_bboxes: false".

As i understand the function pruneBboxes() (in detectnet_coverage_rectangle.cpp) adjusts the boxes according to the done transformation. what happens when crop_bboxes is set to false?

sherifshehata avatar Sep 07 '16 10:09 sherifshehata

Hello,

Could you please point me in the right direction before I spend at lot of time on annotating images that cannot be used in later processing?

You told me that I cannot use cropped images, and I can see why ...

But I would like to use object detection in Digits, so I'm willing to start over, and annotate the images again (determine bounding box coordinates around object), but I want to be sure I do it the right way this time.

So, this is my setup :

Suppose I want to detect if a certain object (let's call it classA) is present in an unknown image.

I start with collecting a number of images, e.g. 1000 images that contain objects of classA.

All those images can have different dimensions : 480x480 ; 640x640 ; 800x600 ; 1024x1024 ; 3200x1800 ; 726x1080 ; 1280x2740 ; ...

First question : how do I start?

a] Keep the original dimensions, and get the bounding box coordinates for the object of classA in the image ?

b] Resize the images, so they all have comparable dimensions (e.g. resize so the smallest or largest dimension is 640), and after that get the bounding box coordinates for the object of classA in the resized image ?

c] Non of the options above ; all images must have exactly the same dimensions, so resize all images to the same dimensions, and after that get the bounding box coordinates.

Option a] and b] can be done without a problem, c] is not that flexible, so rather not if not necessary.

So, that's the first thing I need to know : how do I start, can I get bounding boxes for the original images, or do I have to resize the images before determing the bounding boxes?

And then the second question : if I follow option a], b] or c] ... I will have 1000 images with for each image the bounding boxes around objects of classA.

After that I'm ready to create the database.

For parameter 'custom classes', I can use 'dontcare,classA'.

But how do I use the 'padding image' and 'resize image'?

I hope you can help me, cause I really want to try to detect objects on my own data, but it's not clear to me how to get started ...

With kind regards,

Johan.


Van: Jon Barker [email protected] Verzonden: dinsdag 6 september 2016 14:19 Aan: NVIDIA/DIGITS CC: JVR32; Mention Onderwerp: Re: [NVIDIA/DIGITS] How can I use DetectNet for custom size data ? (#980)

@JVR32https://github.com/JVR32 DetectNet is not designed to work with datasets of the kind that you describe. A dataset for DetectNet should be images where the object you wish detect is some smaller part of the image and has a bounding box label that is a smaller part of the image. Some of these images could have objects that take up a large part of the image, but not all of them as it is important for DetectNet to be able to learn what "non-object" image/pixels looks like around a bounding box. That ability to learn a robust background model is why DetectNet can work well. Also note that you will need to modify the statndard DetectNet to work for multi-class object detection.

If you have access to the original dataset that you cropped the objects from then you should create a training dataset from those images and use the crop locations as the bounding box annotations to use DetectNet.

If you only have the cropped images to train on then you should just train an image classification network but make sure you train a Fully Convolutional Network (FCN). See herehttps://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb. An FCN for image classification can then be applied to a test image of any size and the output will be a "heatmap" of where objects might be present in the image. Note that this approach will not be as accurate as DetectNet and will suffer from a higher false alarm rate unless you also add non-object/background training samples to your dataset.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/DIGITS/issues/980#issuecomment-244932886, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATALz186HGgQ2h3imgxbl66Gqoczz2aZks5qnVpvgaJpZM4JlRVq.

JVR32 avatar Sep 08 '16 22:09 JVR32

@JVR32 You can annotate bounding boxes on the images in their original size - this is probably desirable so that you can use them in that form in the future. DIGITS can resize the images and bounding box annotations during data ingest.

There's no definitive way to use 'padding image' and 'resize image', but to use DetectNet without modification you want to ensure that most of your objects are within the 50x50 to 400x400 pixel range. The benefit of padding is that you maintain aspect ratio and pixel resolution/object scaling. Having said that, if you have large variation in your input image sizes it is not desirable to pad too much around small images, so you may choose to resize all images to some size in the middle.

jon-barker avatar Sep 09 '16 02:09 jon-barker

Thank you very much for the information.

In that case, I think it is best that I resize all images to have -more or less- the same dimensions before starting to process them.

=> I will resize all images so the smallest dimension is 640.

Then, input images will have dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ... and then normally, the object sizes will be in the range 50x50 to 400x400.

=> After resizing, I can start annotating, and determine the bounding boxes in the resized images.

Note : I think the bounding boxes don't have to be square?!

And when I'm done annotating, I will have a set of images with the smallest dimension 640 and bounding boxes in those images.

Maintaining the aspect ratio is important, so since I will have resized the images before annotating them, I suppose it is better to use padding (instead of resize) while creating the dataset?

I'll have to use padding if I'm correct, cause all the input images must have the same dimensions, right? So is it correct to leave the 'resize' parameters empty in that case, and put the padding so that all images (with dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ...) will fit in it -> e.g. 800 x 800 ?

Or do I have to set a bigger padding (e.g. 1024 x 1024) and set resize to 800x800?

I guess I have to use at least one of both parameters (padding or resizing), that I cannot just input the images with various dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ... without setting one of the 2 mentioned parameters?


Van: Jon Barker [email protected] Verzonden: vrijdag 9 september 2016 4:23 Aan: NVIDIA/DIGITS CC: JVR32; Mention Onderwerp: Re: [NVIDIA/DIGITS] How can I use DetectNet for custom size data ? (#980)

@JVR32https://github.com/JVR32 You can annotate bounding boxes on the images in their original size - this is probably desirable so that you can use them in that form in the future. DIGITS can resize the images and bounding box annotations during data ingest.

There's no definitive way to use 'padding image' and 'resize image', but to use DetectNet without modification you want to ensure that most of your objects are within the 50x50 to 400x400 pixel range. The benefit of padding is that you maintain aspect ratio and pixel resolution/object scaling. Having said that, if you have large variation in your input image sizes it is not desirable to pad too much around small images, so you may choose to resize all images to some size in the middle.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/DIGITS/issues/980#issuecomment-245800374, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATALz5FrlZjG_bTSn223B30HYGrF4Mgpks5qoMMLgaJpZM4JlRVq.

JVR32 avatar Sep 09 '16 06:09 JVR32

@JVR32

I think the bounding boxes don't have to be square?!

Correct

Or do I have to set a bigger padding (e.g. 1024 x 1024) and set resize to 800x800?

You don't need to do this, you can just pad up to the standard size. If all of your images are already 640 in one dimension then I would must pad the other dimension to the size of your largest image in that dimension. That minimizes further unnecessary manipulation of the data.

jon-barker avatar Sep 09 '16 12:09 jon-barker

I am trying to train a model for pedestrians (using TownCentre annotations) based on the KITTI example for cars. First I tried using the original resolutions (1920x1080), but changing the network parameters according to the comments above (replacing 1248x348/352 with the new resolution) lead to the error "bottom[i]->shape == bottom[0]->shape" which I was not able to solve.

To avoid having to change the network parameters, I just rescaled all training images ( and annotations accordingly) to the same resolution as KITTI, but the accuracy remains very low (also after processing 350 epochs). When I tried the advice of using cropping from the images, I fall back to the same error message about the shape.

Is there some other example available for object detection with different resolution input that reaches acceptable results?

fdesmedt avatar Sep 20 '16 06:09 fdesmedt

which layer give this error? My guess it is because your resolution is not divisible by 16, so you need to set your should replace 1248x348/352 with 1920x1080/1088

sherifshehata avatar Sep 20 '16 07:09 sherifshehata

The problem is always on "bboxes-masked"

I will try your suggestion. The original size in the network is actually 1248x384 (copied the values from above, which turned out to be incorrect). The 384 value is however divisible by 16, so what is the reason for using 352 there?

Another question. Is it important to scamble the data? The training data I have are the images of a long sequence, so consecutive frames contain a lot of the same pedestrians. Is this data scrambled before training? Or should I do this myself?

fdesmedt avatar Sep 20 '16 07:09 fdesmedt

I have tried your suggestion, but still get an error on the shape-issue. I attach the resulting log-file: caffe_output.txt

fdesmedt avatar Sep 20 '16 08:09 fdesmedt

Did you do any other changes? your bboxes is 3 4 67 120, while i think it should be 3 4 68 120

sherifshehata avatar Sep 20 '16 08:09 sherifshehata

I did not change anything else, just replaced all instanced of 1248 by 1920, 384 by 1080 and 352 by 1088. Does the last one make sense?

It seems indeed that the 67 is the problem. I think it comes from the 1080 size, which is pooled 4 times (leading to dimention 540, 270, 135 and 67 of which the last one is truncated). I am now recreating the dataset with padding to 1088 to avoid the truncation. Hopes this helps ;)

fdesmedt avatar Sep 20 '16 09:09 fdesmedt

Hello,

I trained a detection network as follows :

All training images (containing the objects I want to detect) can have different dimensions : 480x480 ; 640x640 ; 800x600 ; 1024x1024 ; 3200x1800 ; 726x1080 ; 1280x2740 ; 5125x3480 ; ... Before annotating (determining the bounding boxes around the objects in the images -> needed for KITTI format), I resized all those images so the largest dimension is 640. Then, input images will have dimensions 640x640 ; 640x400 ; 490x640 ; 380x640 ; 640x500; ... and then normally, the object sizes will be in the range 50x50 to 400x400. After resizing, I can start annotating, and determine the bounding boxes in the resized images. And when I'm done annotating, I have a set of images with the largest dimension 640 and bounding boxes around the objects of interest in those images.

I use those resized images and the bounding boxes around the objects for building a dataset, using this settings : image1 So I padded the images to 640 x 640.

In 'detectnet_network.prototxt', I replaced 384/352 and 1248 by 640.

After training, I want to test the network.

The images I want to test can also have different dimensions. I can resize those images so the largest dimension is 640, but I don't know if that is necessary? And since the images can have different dimensions, it seems logical to me to put the Do not resize input image(s) flag to TRUE? image2

I created a text file with the paths to the images I would like to test. If I use this file for 'test many', it will generate some results if the Do not resize input image(s) is not set. If I set this flag to TRUE, it generates an error :

Couldn't import dot_parser, loading of dot files will not be possible. 2016-09-30 10:13:21 [ERROR] ValueError: could not broadcast input array from shape (3,480,640) into shape (3,640,640) Traceback (most recent call last): File "C:\Programs\DIGITS-master\tools\inference.py", line 293, in args['resize'] File "C:\Programs\DIGITS-master\tools\inference.py", line 167, in infer resize=resize) File "C:\Programs\DIGITS-master\digits\model\tasks\caffe_train.py", line 1394, in infer_many resize=resize) File "C:\Programs\DIGITS-master\digits\model\tasks\caffe_train.py", line 1434, in infer_many_images 'data', image) ValueError: could not broadcast input array from shape (3,480,640) into shape (3,640,640)

What I don't understand : if I put only 1 file in the images list (and press test many), there is no error. If I put multiple files in the list, I get the error. But only if the 'do not resize' flag is checked ; if not checked -> no error?

Is this a bug, or is there a logical explanation? Anyhow, I guess it must be possible to process a list of (test)images without resizing them before object detection? If it works for a single image in the list, it should also be possible for multiple images?

JVR32 avatar Sep 30 '16 08:09 JVR32

Hi @JVR32 thanks for the detailed report! This is a bug indeed, sorry about that. I agree the error is certainly not explicit! We have a Github issue for this: #1092 In short the explanation is: when you test a batch of images, they must all have the same size otherwise you can't fit them all into a tensor.

gheinrich avatar Sep 30 '16 11:09 gheinrich

Hello,

I trained a network for object detection -> setup was described in my previous post (2 posts above this). As you can see, I resized (before annotating) all the training and validation images so the largest dimension is 640 pixels, but the other dimension isn't always the same, it can vary -> 640x480;640x402;640x380;312x640...

While building the dataset, I set the option to pad the images to 640 x 640 (I didn't touch the 'resize' option).

2 questions :

A] Since the test images can also have different dimensions, I though I should check the flag 'do not resize input images', especially since I didn't use the 'resize' while creating the dataset. But somehow the detection seems better if the 'do not resize input images'-flag is unchecked -> although the aspect ratio changes -> test image of 640 x 480 becomes an image of 640 x 640. Is this logical?

B] For the training, I used the following settings :

train_settings

Plotting the precision for a different number of (training)images : plot

As you can see, increasing the number of images improves the precision, but at a certain point, increasing the number of images produces a worse precision (red curve). My question : does anyone has a suggestion on what to try first to improve the model -> other solver type, other learning rate, other parameter value ... what can I do first?

JVR32 avatar Oct 11 '16 08:10 JVR32

This is my first experiment with DETECTNET I build dataset as specified as in the link https://github.com/NVIDIA/DIGITS/blob/master/examples/object-detection/README.md. The resulting dataset properties are given below

DB backend: lmdb
Create train_db DB
    Entry Count: 645
    Feature shape (3, 800, 1360)
    Label shape (1, 7, 16)
Create val_db DB
    Entry Count: 96
    Feature shape (3, 800, 1360)
    Label shape (1, 5, 16)

I tried to train the model with the above dataset. The configurations are done as specified in the above link.

screenshot from 2016-11-28 18 40 29

the initial part of the .prototxt file is given below

name: "DetectNet" layer { name: "train_data" type: "Data" top: "data" data_param { batch_size: 4 } include: { phase: TRAIN } } layer { name: "train_label" type: "Data" top: "label" data_param { batch_size:4 } include: { phase: TRAIN } } layer { name: "val_data" type: "Data" top: "data" data_param { batch_size: 4 } include: { phase: TEST stage: "val" } } layer { name: "val_label" type: "Data" top: "label" data_param { batch_size:4 } include: { phase: TEST stage: "val" } } layer { name: "deploy_data" type: "Input" top: "data" input_param { shape { dim: 1 dim: 3 dim: 800 dim: 1360 } } include: { phase: TEST not_stage: "val" } }

layer { name: "train_transform" type: "DetectNetTransformation" bottom: "data" bottom: "label" top: "transformed_data" top: "transformed_label" detectnet_groundtruth_param: { stride: 10 scale_cvg: 0.4 gridbox_type: GRIDBOX_MIN coverage_type: RECTANGULAR min_cvg_len: 20 obj_norm: true image_size_x: 1360 image_size_y: 800 crop_bboxes: true object_class: { src: 1 dst: 0} # obj class 1 -> cvg index 0 } detectnet_augmentation_param: { crop_prob: 1 shift_x: 32 shift_y: 32 flip_prob: 0.5 rotation_prob: 0 max_rotate_degree: 5 scale_prob: 0.4 scale_min: 0.8 scale_max: 1.2 hue_rotation_prob: 0.8 hue_rotation: 30 desaturation_prob: 0.8 desaturation_max: 0.8 } transform_param: { mean_value: 127 } include: { phase: TRAIN } } layer { name: "val_transform" type: "DetectNetTransformation" bottom: "data" bottom: "label" top: "transformed_data" top: "transformed_label" detectnet_groundtruth_param: { stride: 10 scale_cvg: 0.4 gridbox_type: GRIDBOX_MIN coverage_type: RECTANGULAR min_cvg_len: 20 obj_norm: true image_size_x: 1360 image_size_y: 800 crop_bboxes: false object_class: { src: 1 dst: 0} # obj class 1 -> cvg index 0 } transform_param: { mean_value: 127 } include: { phase: TEST stage: "val" } } layer { name: "deploy_transform" type: "Power" bottom: "data" top: "transformed_data" power_param { shift: -127 } include: { phase: TEST not_stage: "val" } }

layer { name: "slice-label" type: "Slice" bottom: "transformed_label" top: "foreground-label" top: "bbox-label" top: "size-label" top: "obj-label" top: "coverage-label" slice_param { slice_dim: 1 slice_point: 1 slice_point: 5 slice_point: 7 slice_point: 8 } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "coverage-block" type: "Concat" bottom: "foreground-label" bottom: "foreground-label" bottom: "foreground-label" bottom: "foreground-label" top: "coverage-block" concat_param { concat_dim: 1 } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "size-block" type: "Concat" bottom: "size-label" bottom: "size-label" top: "size-block" concat_param { concat_dim: 1 } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "obj-block" type: "Concat" bottom: "obj-label" bottom: "obj-label" bottom: "obj-label" bottom: "obj-label" top: "obj-block" concat_param { concat_dim: 1 } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "bb-label-norm" type: "Eltwise" bottom: "bbox-label" bottom: "size-block" top: "bbox-label-norm" eltwise_param { operation: PROD } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "bb-obj-norm" type: "Eltwise" bottom: "bbox-label-norm" bottom: "obj-block" top: "bbox-obj-label-norm" eltwise_param { operation: PROD } include { phase: TRAIN } include { phase: TEST stage: "val" } }_

While training I am getting an error

"ERROR: Check failed: bottom[i]->shape() == bottom[0]->shape()" details Creating layer coverage/sig Creating Layer coverage/sig coverage/sig <- cvg/classifier coverage/sig -> coverage Setting up coverage/sig Top shape: 3 1 50 85 (12750) Memory required for data: 4344944304 Creating layer bbox/regressor Creating Layer bbox/regressor bbox/regressor <- pool5/drop_s1_pool5/drop_s1_0_split_1 bbox/regressor -> bboxes Setting up bbox/regressor Top shape: 3 4 50 85 (51000) Memory required for data: 4345148304 Creating layer bbox_mask Creating Layer bbox_mask bbox_mask <- bboxes bbox_mask <- coverage-block bbox_mask -> bboxes-masked Check failed: bottom[i]->shape() == bottom[0]->shape()

I changed batch size, stide size etc but nothing helped. What should I do?

Regards,

varunvv avatar Nov 29 '16 04:11 varunvv

Apologies.. I made a mistake. The name used in label (.txt) file and Custom classes were different.

regards,

varunvv avatar Nov 29 '16 12:11 varunvv

@lukeyeager Hello, is there a complete explanation on exactly which parameters one needs to adjust to train DetectNet on custom sized data where the stride parameter also requires modification? So far I have been able to change the image sizes in the prototxt file and the training started without any errors but changing the stride parameter looks quite challenging; changing it results in various errors ...

ShervinAr avatar Feb 02 '17 16:02 ShervinAr

Hi, everybody!

i`ve clonned this repo https://github.com/skyzhao3q/NvidiaDigitsObjDetect

and done everything as it mentioned in Readme(made dataset for Object Detection, run network). But everything i`ve got was this.

screenshot from 2017-03-29 19-32-08

What`s the problem with one class? Full kitti goes ok.

aprentis avatar Mar 30 '17 08:03 aprentis

@aprentis Can you hover over the graph so that we can the actual numeric results for the metric - it matters greatly whether those numbers are just small or exactly zero?

Looking at the repo you cloned I noticed that the model has explicit "dontcare" regions marked - whilst this can be useful, e.g. for masking out the sky when you only care about the road it is not necessary. I'm not sure what regions are being marked as "dontcare" for this data, but if it includes the sidewalks where the pedestrians are then you're going to have problems.

jon-barker avatar Mar 30 '17 13:03 jon-barker

@jbarker-nvidia those numbers are exactly zero. Right now i`m training another model(with one class in it), unfortanately it has same problem.

In this repo i`ve found result screenshot, which says that mAP is OK after 10 epochs. Does anybody know any successfull story about training DetectNet with only one class?

aprentis avatar Mar 30 '17 13:03 aprentis