Object Detection using SSD

14 min readSep 16, 2021

Abstract

Fast Moving Consumer Goods brands require insights into retail shelves to help them improve their sales. One such insight comes from determining how many products of their brands are present versus how many products of competing brands are present on the retail store shelf. This requires finding the total number of products present on every shelf in a retail store.
The problem statement uses grocery store annotated shelf images to detect all products present on every shelf (detection only at product or no product level) using Single Shot Object Detection Using Multiple Anchor Boxes Per Feature Map Cell.
Few months back , I was working on the above problem statement and learnt about SSD. Here I’ll be sharing my learnings such as SSD architechture , implementation, and results obtained to this problem statement

Introduction

The above problem statement is one of the tasks of Computer Vision called Object Detection. There are various algorithms to perform object detection tasks like R_CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD. SSD has both high speed and accuracy . This is the reason why we are using an SSD.

SSD uses VGG16 to extract feature maps and then apply conv layers to detect objects.
After going through conv layers for feature extraction, a m*n feature map with p channels is generated. Here , m*n is the number of locations in an image.
In m*n location, k bounding boxes are drawn, so the total number of bounding boxes is m*n*k.
For each bounding box, we will find C class scores and 4 offsets. Hence the total number of outputs is m*n*(c+4).
Going through the above logic,we will calculate the number of bounding boxes generated in ssd

After VGG16, 6 Conv layers are used: Conv4_3,Conv7,Conv8_2,Conv9_2,Conv10_2, and Conv11_2

Conv4_3:3838*4=5776 (4 boxes for each location)
Conv7: 19×19×6 = 2166 boxes (6 boxes for each location)
Conv8_2: 10×10×6 = 600 boxes (6 boxes for each location)
Conv9_2: 5×5×6 = 150 boxes (6 boxes for each location)
Conv10_2: 3×3×4 = 36 boxes (4 boxes for each location)
Conv11_2: 1×1×4 = 4 boxes (4 boxes for each location)
Total=5776+2166+600+150+36+4=8732

As CNN reduces the spatial dimension gradually, the resolution of the feature maps also decrease. SSD uses lower resolution layers to detect larger scale objects. SSD adds 6 more auxiliary convolution layers after the VGG16. Five of them will be added for object detection. In three of those layers, we make 6 predictions instead of 4. In total, SSD makes 8732 predictions using 6 layers.

Figure1: SSD: Single Shot MultiBox Detector

Motivation

The motivation behind using an SSD is its high speed and accuracy in real time. It avoids the Region Proposal Networks of Faster R CNN to identify the regions which takes two shots, one to identify the region, second to identify the object in that proposed region. Rather it predicts both class and location of object in a single shot outperforming all other algorithms. SSD300 achieves 74.3% mAP at 59 FPS while SSD500 achieves 76.9% mAP at 22 FPS, which outperforms Faster R-CNN (73.2% mAP at 7 FPS) and YOLOv1 (63.4% mAP at 45 FPS).

Methodology

Single Shot Multibox Detection is simple, fast, and widely used model. Although this is just one of vast amounts of object detection models, some of the design principles and implementation details in this section are also applicable to other models.

Model

Single-shot multibox detection mainly consists of a base network followed by several multiscale feature map blocks. The base network is for extracting features from the input image, so it can use a deep CNN. For example, the original single-shot multibox detection paper adopts a VGG network truncated before the classification layer, while Reset can also be used. Through our design we can make the base network output larger feature maps so as to generate more anchor boxes for detecting smaller objects. Subsequently, each multiscale feature map block reduces (e.g., by half) the height and width of the feature maps from the previous block, and enables each unit of the feature maps to increase its receptive field on the input image.

The design of multiscale object detection is layer wise. Since multiscale feature maps closer to the top are smaller but have larger receptive fields, they are suitable for detecting fewer but larger objects.

In a nutshell, via its base network and several multiscale feature map blocks, single-shot multibox detection generates a varying number of anchor boxes with different sizes, and detects varying-size objects by predicting classes and offsets of these anchor boxes (thus the bounding boxes); thus, this is a multiscale object detection model.

In the following, we will describe the implementation details of different blocks of SSD. To begin with, we discuss how to implement the class and bounding box prediction.

Class Prediction Layer

Let the number of object classes be q. Then anchor boxes have q+1 classes, where class 0 is background. At some scale, suppose that the height and width of feature maps are h and w, respectively. When a anchor boxes are generated with each spatial position of these feature maps as their center, a total of hwa anchor boxes need to be classified. This often makes classification with fully-connected layers infeasible due to likely heavy parameterization costs. But similar to how channels of convolutional layers predict classes, Single-shot multibox detection uses the same technique to reduce model complexity.

Specifically, the class prediction layer uses a convolutional layer without altering width or height of feature maps. In this way, there can be a one-to-one correspondence between outputs and inputs at the same spatial dimensions (width and height) of feature maps. More concretely, channels of the output feature maps at any spatial position (x, y) represent class predictions for all the anchor boxes centered on (x, y) of the input feature maps. To produce valid predictions, there must be a(q+1) output channels, where for the same spatial position the output channel with index i(q+1)+j represents the prediction of the class j(0≤j≤q) for the anchor box i (0≤i<a).

Below we define such a class prediction layer, specifying a and q via arguments num_anchors and num_classes, respectively. This layer uses a 3×3 convolutional layer with a padding of 1. The width and height of the input and output of this convolutional layer remain unchanged.

PYTORCH
%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
def cls_predictor(num_inputs, num_anchors, num_classes):
return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1),
kernel_size=3, padding=1)

Bounding Box Prediction Layer

The design of the bounding box prediction layer is similar to that of the class prediction layer. The only difference lies in the number of outputs for each anchor box: here we need to predict four offsets rather than q+1 classes.

PYTORCH
def bbox_predictor(num_inputs, num_anchors):
return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1)

Concatenating Predictions for Multiple Scales

As we mentioned, single-shot multibox detection uses multiscale feature maps to generate anchor boxes and predict their classes and offsets. At different scales, the shapes of feature maps or the numbers of anchor boxes centered on the same unit may vary. Therefore, shapes of the prediction outputs at different scales may vary.

In the following example, we construct feature maps at two different scales, Y1 and Y2, for the same minibatch, where the height and width of Y2 are half of those of Y1. Let us take class prediction as an example. Suppose that 5 and 3 anchor boxes are generated for every unit in Y1 and Y2, respectively. Suppose further that the number of object classes is 10. For feature maps Y1 and Y2 the numbers of channels in the class prediction outputs are 5×(10+1)=55 and 3×(10+1)=33 respectively, where either output shape is (batch size, number of channels, height, width).

PYTORCH
def forward(x, block):
return block(x)
Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10))
Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10))
Y1.shape, Y2.shape
(torch.Size([2, 55, 20, 20]), torch.Size([2, 33, 10, 10]))

As we can see, except for the batch size dimension, the other three dimensions all have different sizes. To concatenate these two prediction outputs for more efficient computation, we will transform these tensors into a more consistent format.

Note that the channel dimension holds the predictions for anchor boxes with the same center. We first move this dimension to the innermost. Since the batch size remains the same for different scales, we can transform the prediction output into a two-dimensional tensor with shape (batch size, height ×× width ×× number of channels). Then we can concatenate such outputs at different scales along dimension 1.

PYTORCH
def flatten_pred(pred):
return torch.flatten(pred.permute(0, 2, 3, 1), start_dim=1)
def concat_preds(preds):
return torch.cat([flatten_pred(p) for p in preds], dim=1)

In this way, even though Y1 and Y2 have different sizes in channels, heights, and widths, we can still concatenate these two prediction outputs at two different scales for the same minibatch.

PYTORCH
concat_preds([Y1, Y2]).shape
torch.Size([2, 25300])

Downsampling Block

In order to detect objects at multiple scales, we define the following downsampling block down_sample_blk that halves the height and width of input feature maps. In fact, this block applies the design of VGG blocks . More concretely, each downsampling block consists of two 3×3 convolutional layers with padding of 1 followed by a 2×2 maximum pooling layer with stride of 2. As we know, 3×3 convolutional layers with padding of 1 do not change the shape of feature maps. However, the subsequent 2×2 maximum pooling reduces the height and width of input feature maps by half. For both input and output feature maps of this downsampling block, because 1×2+(3−1)+(3−1)=6, each unit in the output has a 6×6 receptive field on the input. Therefore, the downsampling block enlarges the receptive field of each unit in its output feature maps.

PYTORCH
def down_sample_blk(in_channels, out_channels):
blk = []
for _ in range(2):
blk.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
blk.append(nn.BatchNorm2d(out_channels))
blk.append(nn.ReLU())
in_channels = out_channels
blk.append(nn.MaxPool2d(2))
return nn.Sequential(*blk)

In the following example, our constructed downsampling block changes the number of input channels and halves the height and width of the input feature maps.

PYTORCH
forward(torch.zeros((2, 3, 20, 20)), down_sample_blk(3, 10)).shape
torch.Size([2, 10, 10, 10])

Base Network Block

The base network block is used to extract features from input images. For simplicity, we construct a small base network consisting of three downsampling blocks that double the number of channels at each block. Given a 256×256 input image, this base network block outputs 32×32 feature maps (256/23=32).

PYTORCH
def base_net():
blk = []
num_filters = [3, 16, 32, 64]
for i in range(len(num_filters) — 1):
blk.append(down_sample_blk(num_filters[i], num_filters[i + 1]))
return nn.Sequential(*blk)
forward(torch.zeros((2, 3, 256, 256)), base_net()).shape
torch.Size([2, 64, 32, 32])

The Complete Model

The complete single shot multibox detection model consists of five blocks. The feature maps produced by each block are used for both (i) generating anchor boxes and (ii) predicting classes and offsets of these anchor boxes. Among these five blocks, the first one is the base network block, the second to the fourth are downsampling blocks, and the last block uses global maximum pooling to reduce both the height and width to 1. Technically, the second to the fifth blocks are all those multiscale feature map blocks

PYTORCH
def get_blk(i):
if i == 0:
blk = base_net()
elif i == 1:
blk = down_sample_blk(64, 128)
elif i == 4:
blk = nn.AdaptiveMaxPool2d((1, 1))
else:
blk = down_sample_blk(128, 128)
return blk

Now we define the forward propagation for each block. Different from in image classification tasks, outputs here include (i) CNN feature maps Y, (ii) anchor boxes generated using Y at the current scale, and (iii) classes and offsets predicted (based on Y) for these anchor boxes.

PYTORCH
def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
Y = blk(X)
anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)
cls_preds = cls_predictor(Y)
bbox_preds = bbox_predictor(Y)
return (Y, anchors, cls_preds, bbox_preds)

Recall that a multiscale feature map block that is closer to the top is for detecting larger objects; thus, it needs to generate larger anchor boxes. In the above forward propagation, at each multiscale feature map block we pass in a list of two scale values via the sizes argument of the invoked multibox_prior function . In the following, the interval between 0.2 and 1.05 is split evenly into five sections to determine the smaller scale values at the five blocks: 0.2, 0.37, 0.54, 0.71, and 0.88. Then their larger scale values are given by 0.2×0.37=0.272, 0.37×0.54=0.447, and so on.

PYTORCH
sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
[0.88, 0.961]]
ratios = [[1, 2, 0.5]] * 5
num_anchors = len(sizes[0]) + len(ratios[0]) — 1

Now we can define the complete model TinySSD as follows.

PYTORCH
class TinySSD(nn.Module):
def __init__(self, num_classes, **kwargs):
super(TinySSD, self).__init__(**kwargs)
self.num_classes = num_classes
idx_to_in_channels = [64, 128, 128, 128, 128]
for i in range(5):
# Equivalent to the assignment statement `self.blk_i = get_blk(i)`
setattr(self, f’blk_{i}’, get_blk(i))
setattr(
self, f’cls_{i}’,
cls_predictor(idx_to_in_channels[i], num_anchors,
num_classes))
setattr(self, f’bbox_{i}’,
bbox_predictor(idx_to_in_channels[i], num_anchors))
def forward(self, X):
anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
for i in range(5):
# Here `getattr(self, ‘blk_%d’ % i)` accesses `self.blk_i`
X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
X, getattr(self, f’blk_{i}’), sizes[i], ratios[i],
getattr(self, f’cls_{i}’), getattr(self, f’bbox_{i}’))
anchors = torch.cat(anchors, dim=1)
cls_preds = concat_preds(cls_preds)
cls_preds = cls_preds.reshape(cls_preds.shape[0], -1,
self.num_classes + 1)
bbox_preds = concat_preds(bbox_preds)
return anchors, cls_preds, bbox_preds

We create a model instance and use it to perform forward propagation on a minibatch of 256×256 images X.

As shown earlier in this section, the first block outputs 32×32 feature maps. Recall that the second to fourth downsampling blocks halve the height and width and the fifth block uses global pooling. Since 4 anchor boxes are generated for each unit along spatial dimensions of feature maps, at all the five scales a total of (322+162+82+42+1)×4=5444 anchor boxes are generated for each image.

PYTORCH
net = TinySSD(num_classes=1)
X = torch.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)
print(‘output anchors:’, anchors.shape)
print(‘output class preds:’, cls_preds.shape)
print(‘output bbox preds:’, bbox_preds.shape)
output anchors: torch.Size([1, 5444, 4])
output class preds: torch.Size([32, 5444, 2])
output bbox preds: torch.Size([32, 21776])

Training

Now we will seehow to train the single shot multibox detection model for object detection.

Reading the Dataset and Initializing the Model

To begin with, let us read the shelf image detection dataset

PYTORCH
batch_size = 32
train_iter, _ = d2l.load_shelfimage_data(batch_size)

Downloading ../data/detection.zip from .

read 1000 training examples

read 100 validation examples

There is only one class in the shelfimage detection dataset. After defining the model, we need to initialize its parameters and define the optimization algorithm.

PYTORCH
device, net = d2l.try_gpu(), TinySSD(num_classes=1)
trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4)

Defining Loss and Evaluation Functions

Object detection has two types of losses. The first loss concerns classes of anchor boxes: its computation can simply reuse the cross-entropy loss function that we used for image classification. The second loss concerns offsets of positive (non-background) anchor boxes: this is a regression problem. For this regression problem, however, here we do not use the squared loss . Instead, we use the L1 norm loss, the absolute value of the difference between the prediction and the ground-truth. The mask variable bbox_masks filters out negative anchor boxes and illegal (padded) anchor boxes in the loss calculation. In the end, we sum up the anchor box class loss and the anchor box offset loss to obtain the loss function for the model.

PYTORCH
cls_loss = nn.CrossEntropyLoss(reduction=’none’)
bbox_loss = nn.L1Loss(reduction=’none’)
def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2]
cls = cls_loss(cls_preds.reshape(-1, num_classes),
cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1)
bbox = bbox_loss(bbox_preds * bbox_masks,
bbox_labels * bbox_masks).mean(dim=1)
return cls + bbox

We can use accuracy to evaluate the classification results. Due to the used L1 norm loss for the offsets, we use the mean absolute error to evaluate the predicted bounding boxes. These prediction results are obtained from the generated anchor boxes and the predicted offsets for them.

PYTORCH
def cls_eval(cls_preds, cls_labels):
# Because the class prediction results are on the final dimension,
# `argmax` needs to specify this dimension
return float(
(cls_preds.argmax(dim=-1).type(cls_labels.dtype) == cls_labels).sum())
def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
return float((torch.abs((bbox_labels — bbox_preds) * bbox_masks)).sum())

Training the Model

When training the model, we need to generate multiscale anchor boxes (anchors) and predict their classes (cls_preds) and offsets (bbox_preds) in the forward propagation. Then we label the classes (cls_labels) and offsets (bbox_labels) of such generated anchor boxes based on the label information Y. Finally, we calculate the loss function using the predicted and labeled values of the classes and offsets. For concise implementations, evaluation of the test dataset is omitted here.

PYTORCH
num_epochs, timer = 20, d2l.Timer()
animator = d2l.Animator(xlabel=’epoch’, xlim=[1, num_epochs],
legend=[‘class error’, ‘bbox mae’])
net = net.to(device)
for epoch in range(num_epochs):
# Sum of training accuracy, no. of examples in sum of training accuracy,
# Sum of absolute error, no. of examples in sum of absolute error
metric = d2l.Accumulator(4)
net.train()
for features, target in train_iter:
timer.start()
trainer.zero_grad()
X, Y = features.to(device), target.to(device)
# Generate multiscale anchor boxes and predict their classes and
# offsets
anchors, cls_preds, bbox_preds = net(X)
# Label the classes and offsets of these anchor boxes
bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)
# Calculate the loss function using the predicted and labeled values
# of the classes and offsets
l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
bbox_masks)
l.mean().backward()
trainer.step()
metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel(),
bbox_eval(bbox_preds, bbox_labels, bbox_masks),
bbox_labels.numel())
cls_err, bbox_mae = 1 — metric[0] / metric[1], metric[2] / metric[3]
animator.add(epoch + 1, (cls_err, bbox_mae))
print(f’class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}’)
print(f’{len(train_iter.dataset) / timer.stop():.1f} examples/sec on ‘
f’{str(device)}’)
class err 3.19e-03, bbox mae 3.10e-03
5337.5 examples/sec on cuda:0

Prediction

During prediction, the goal is to detect all the objects of interest on the image. Below we read and resize a test image, converting it to a four-dimensional tensor that is required by convolutional layers.

PYTORCH
X = torchvision.io.read_image(‘../img/shelfimage.jpg’).unsqueeze(0).float()
img = X.squeeze(0).permute(1, 2, 0).long()
Using the multibox_detection function below, the predicted bounding boxes are obtained from the anchor boxes and their predicted offsets. Then non-maximum suppression is used to remove similar predicted bounding boxes.
PYTORCH
def predict(X):
net.eval()
anchors, cls_preds, bbox_preds = net(X.to(device))
cls_probs = F.softmax(cls_preds, dim=2).permute(0, 2, 1)
output = d2l.multibox_detection(cls_probs, bbox_preds, anchors)
idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
return output[0, idx]
output = predict(X)

Finally, we display all the predicted bounding boxes with confidence 0.9 or above as the output.

PYTORCH
def display(img, output, threshold):
d2l.set_figsize((5, 5))
fig = d2l.plt.imshow(img)
for row in output:
score = float(row[1])
if score < threshold:
continue
h, w = img.shape[0:2]
bbox = [row[2:6] * torch.tensor((w, h, w, h), device=row.device)]
d2l.show_bboxes(fig.axes, bbox, ‘%.2f’ % score, ‘w’)
display(img, output.cpu(), threshold=0.9)

Result and Discussion

We had 284 images as training data set and 71 images in test data set. We obtained mAP of 0.905 and mAP of 0.823 on the train and test dataset respectively at an IOU of 0.75.

##Dataset Preparation

Using label Img created annotations for all images in the downloaded ShelfImages dataset.
Converted the generated xml annotation to csv and saved it in the data folder as annotations.txt.
Augmentation used — Horizontal flip of images, Generating images of different contrast and brightness.

## Detection Network Used

Pretrained model — VGG16 with Batch Normalisation
Fine tuned to 60 epochs with the prepared dataset.

### Training Parameters

Epochs — 60
Batch size — 4
Initial learning rate — 0.001
Input image size — 300 X 300 X 3

Created a google colab instance to train it on a GPU

Note: Used multiple anchor box per feature box. Below are the results for reference.

{

“C1_P11_N2_S4_3.JPG”: 41,

“C4_P07_N1_S3_1.JPG”: 40,

“C1_P03_N3_S2_1.JPG”: 30,

“C3_P01_N1_S5_1.JPG”: 32,

“C3_P05_N3_S2_1.JPG”: 13,

“C1_P08_N2_S4_1.JPG”: 56,

“C3_P01_N2_S3_2.JPG”: 24,

“C1_P03_N2_S2_1.JPG”: 25,

“C1_P02_N2_S3_1.JPG”: 28,

“C3_P03_N3_S3_1.JPG”: 39,

“C3_P04_N1_S5_1.JPG”: 41,

“C4_P07_N1_S3_2.JPG”: 40,

“C1_P12_N1_S2_1.JPG”: 14,

“C4_P08_N1_S4_1.JPG”: 44,

“C1_P03_N2_S3_1.JPG”: 43,

“C4_P05_N2_S2_1.JPG”: 6,

“C3_P06_N2_S3_2.JPG”: 37,

“C4_P08_N3_S3_1.JPG”: 33,

“C3_P03_N2_S4_1.JPG”: 50,

“C1_P03_N1_S2_1.JPG”: 28,

“C4_P08_N1_S5_2.JPG”: 36,

“C1_P06_N1_S3_1.JPG”: 36,

“C4_P04_N1_S3_1.JPG”: 26,

“C3_P06_N4_S3_1.JPG”: 32,

“C1_P11_N1_S4_2.JPG”: 39,

“C1_P11_N2_S3_2.JPG”: 30,

“C2_P07_N2_S2_1.JPG”: 23,

“C4_P07_N3_S3_1.JPG”: 42,

“C1_P08_N3_S3_1.JPG”: 42,

“C1_P06_N1_S4_1.JPG”: 49,

“C3_P02_N1_S2_2.JPG”: 23,

“C1_P12_N2_S3_1.JPG”: 20,

“C1_P03_N1_S4_1.JPG”: 17,

“C1_P04_N3_S3_1.JPG”: 44,

“C2_P01_N3_S3_1.JPG”: 45,

“C4_P03_N1_S4_1.JPG”: 46,

“C1_P02_N2_S2_1.JPG”: 19,

“C1_P04_N1_S4_1.JPG”: 59,

“C3_P04_N1_S4_1.JPG”: 35,

“C4_P04_N4_S2_1.JPG”: 15,

“C1_P10_N1_S3_1.JPG”: 29,

“C2_P04_N3_S2_1.JPG”: 16,

“C1_P10_N2_S3_1.JPG”: 32,

“C1_P06_N3_S3_1.JPG”: 38,

“C1_P02_N1_S5_1.JPG”: 46,

“C4_P04_N2_S2_1.JPG”: 13,

“C2_P08_N3_S3_2.JPG”: 24,

“C1_P12_N1_S3_1.JPG”: 21,

“C2_P07_N1_S6_1.JPG”: 26,

“C2_P05_N3_S3_1.JPG”: 35,

“C2_P02_N1_S4_1.JPG”: 47,

“C1_P05_N4_S3_1.JPG”: 35,

“C4_P01_N2_S2_1.JPG”: 21,

“C2_P02_N1_S3_1.JPG”: 36,

“C4_P02_N4_S2_1.JPG”: 23,

“C1_P11_N1_S3_1.JPG”: 27,

“C2_P03_N2_S3_1.JPG”: 18,

“C1_P06_N1_S5_1.JPG”: 36,

“C1_P11_N2_S4_2.JPG”: 41,

“C3_P06_N1_S3_2.JPG”: 42,

“C4_P03_N1_S3_1.JPG”: 34,

“C4_P08_N1_S3_1.JPG”: 36,

“C3_P03_N1_S3_1.JPG”: 37,

“C1_P10_N1_S5_1.JPG”: 30,

“C1_P03_N1_S4_2.JPG”: 8,

“C1_P05_N2_S4_2.JPG”: 48,

“C4_P08_N2_S2_1.JPG”: 19,

“C1_P03_N1_S3_1.JPG”: 44,

“C1_P12_N1_S5_1.JPG”: 33,

“C2_P01_N1_S4_1.JPG”: 34,

“C2_P01_N2_S2_1.JPG”: 29

}

Conclusion and Future Scope

Single shot multibox detection is a multiscale object detection model since it uses multiple anchor boxes of different sizes. It has a base network and several multiscale feature map blocks, generating a varying number of anchor boxes with different sizes, and detects varying-size objects by predicting classes and offsets of these anchor boxes (thus the bounding boxes).When training the single-shot multibox detection model, the loss function is calculated based on the predicted and labeled values of the anchor box classes and offsets. I think in future we can also build SSD with single anchor box instead of multiple anchor boxes in this problem since all the objects here are of same size, single anchor box would give more efficient solution.