Mask R-CNN Image Segmentation – OpenCV

Image Segmentation refers to making partitions along the edges of all the objects that are detected by analyzing the digital images. By dividing the image you can process the important segments of the images instead of processing the entire image. Once the edges of the objects are detected, it is possible to extract the object from the image.

In an image recognition system, segmentation is an important stage that helps to extract the object of interest from an image which is further used for processing like recognition and description. The process involved in image segmentation is:-

  • Image classification
  • Image localization
  • Segmenation

The extraction of different objects depends on the purpose of segmentation. Segmentation can be of two kinds Semantic segmentation and Instance segmentation.

Semantic Segmentation detects and extracts objects and groups them based on their respective classes ( such as dog, person, bike, etc.,), Whereas Instance Segmentation detects, extracts objects, and classifies them individually rather than grouping them into some class.

In the below figure semantic segmentation classifies the objects in the image as dog and sheep and assigns objects of the same classes with the same color, on the other hand, instance segmentation classifies the objects and assigns them with different colors even though they belong to the same class.

Could you explain me how instance segmentation works?

Basically, there are two broader categories of segmentation techniques — Edge-Based & Region-based, but various other image segmentation techniques are required to develop various AI models.

  • Threshold Method
  • Edge Based Segmentation
  • Region Based Segmentation
  • Clustering Based Segmentation
  • Watershed Based Method
  • Partial Differential Equation Based Segmentation Method
  • Artificial Neural Network Based Segmentation

In this article, we will be using Region-based Convolutional Neural Networks( R-CNN ).

Mask Region-based Convolutional Neural Networks ( R-CNN )

Mask R-CNN is built on the previous Object detection works such as R-CNN, Fast R-CNN, Faster R-CNN. So let’s start from the original R-CNN to understand the other versions of it.

Original R-CNN is the first network that is considered as the base model for other R-CNN’s and is a 4-stage process for image segmentation.

  • In the first Stage Input image is fed into the network
  • followed by Extracting region proposal ( around 2000 ) such as bounding boxes in roi and feed them into Convolutional layer
  • Then the image is passed into the pre-trained neural networks. These neural networks are loaded using transfer learning technique
  • Then the output determines the respective classes of the objects
Image 177

Fast R-CNN is based on the original R-CNN and is built-in 2015. The fast R-CNN adds an extra layer known as the ROI pooling layer which adds an extra max-pooling layer that extracts more important features from the object in the ROI. Fast R-CNN is an end-to-end process because there is no need to implement transfer learning.

Advantages of using Fast R-CNN over Original R-CNN

  • Training Process:- Fast R-CNN provides end-to-end process when compared with original R-CNN. Thus training the model is more feasible in fast r-cnn
  • Training Cost:- In original R-CNN we need to store each pixel information on the disk and read them during training this increases cost for training the model. But in Fast R-CNN we reduce the number of pixels using pooling layer thus uses less memory and cost comes down.
  • Object detection speed:- In Original R-CNN object detection is very slow compared to Fast R-CNN
Image 176

In Fast R-CNN the first stage called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression.

Using a convolutional layer it gets a mask of the objects and there is no need to feed the 2000 region proposal to the Convolutional layer.

Image 178

Mask R-CNN is the latest and widely used neural network for image segmentation. The vision community has rapidly improved in image detection and semantic segmentation. Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance.

Over here Mask R-CNN classifies each pixel into a fixed set of categories without differentiating object instances.

Download the Mask R-CNN repository

We can clone the GitHub repo into the google colab notebook using !git clone url

!git clone https://github.com/matterport/Mask_RCNN
>>>Cloning into 'Mask_RCNN'...
   remote: Enumerating objects: 956, done.
   remote: Total 956 (delta 0), reused 0 (delta 0), pack-reused 956
   Receiving objects: 100% (956/956), 125.23 MiB | 25.89 MiB/s, done.
   Resolving deltas: 100% (562/562), done.
Image 179

Using setup.py you can install the dependencies for mask R-CNN

!python setup.py install

Import the necessary libraries into the notebook. The skimage.io is a python module for displaying images, which we also can use cv2_imshow to displaying images. But skimage.io displays images directly in RGB format which is useful while displaying the images in matplotlib.

import os
import sys
import cv2
import numpy as np
import skimage.io
from google.colab.patches import cv2_imshow
import matplotlib.pyplot as plt

For running the mask R-CNN neural network we need to use TensorFlow 1.15 version, in a future version, there might be a chance that TensorFlow provides the runtime for the mask R-CNN. We can use %tensorflow_version 1.x which loads the suitable version.

%tensorflow_version 1.x
import tensorflow as tf


>>> TensorFlow 1.x selected.

In general, to view all the modules that are available in the root directory of your notebook we can use sys.path which returns all the available modules that can be directly imported. If we need to add few more modules we can use sys.path.append() to perform such tasks.


   ----few lines of modules----
   ----few more lines of modules----

For obvious reasons to access or to import our Mask R-CNN module we need to add it to the path, so we use sys.path.append()


All the required python files and the neural network weights are present in the directory samples>>coco. So we need to add the module to our path to directly import the file in the coco directory.

sys.path.append(os.path.join(ROOT_DIR, 'samples/coco/'))

Now if we check sys.path we can view all the added modules i.e., tensor flow, Mask R-CNN, coco.


Importing the neural network files and their weights into our notebook

from mrcnn import utils
from mrcnn import visualize
import mrcnn.model as modellib
import coco

Download the Neural Network and its weights to a specific location

COCO_MODEL_PATH = os.path.join(ROOT_DIR, 'mask_rcnn_coco.h5')
Downloading pretrained model to /content/Mask_RCNN/mask_rcnn_coco.h5 ...
... done downloading pretrained model!

Now we need to set certain parameters before loading the input image. They are:-

  • Changing Runtime to GPU, we can do it by accessing the runtime properties of the google colab notebook.
  • Define Batch Size, determines the number of input images that we are feeding into the network for each runtime.
  • mode = ‘Inference’, since we are using a pre-trained model and its weights we need to determine mode as inference.
  • Configurations class
class InferenceConfig(coco.CocoConfig):
config = InferenceConfig()
BACKBONE                       resnet101
BACKBONE_STRIDES               [4, 8, 16, 32, 64]
BATCH_SIZE                     1
BBOX_STD_DEV                   [0.1 0.1 0.2 0.2]
---few lines about configuration-----
---few more lines about configuration------
TRAIN_BN                       False
TRAIN_ROIS_PER_IMAGE           200
USE_MINI_MASK                  True
USE_RPN_ROIS                   True
VALIDATION_STEPS               50
WEIGHT_DECAY                   0.000

Loading the Neural Network and its weights

MODEL_DIR = os.path.join(ROOT_DIR, 'logs')
IMAGE_DIR = os.path.join(ROOT_DIR, 'images')
network = modellib.MaskRCNN(mode='inference', model_dir=MODEL_DIR, config=config)

Once we load the network the next step is to load the weights, but before loading the weights there might be a chance of facing an error while reading a string from weights. This is because of the latest version of h5py.

The h5py package is a Pythonic interface to the HDF5 binary data format. HDF5 lets you store huge amounts of numerical data, and easily manipulate that data from NumPy.

In the latest version of h5py, defining a string encoding is different, and also handling with TensorFlow version 1.15 is not compatible. By default google colab comes with a h5py version 3.0. So to install a previous version we can use a pip command. After successful installation, you need to restart the runtime.

pip install 'h5py<3.0.0'
Collecting h5py<3.0.0
  Downloading h5py-2.10.0-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)
     |████████████████████████████████| 2.9 MB 8.0 MB/s 
Installing collected packages: h5py
  Attempting uninstall: h5py
    Found existing installation: h5py 3.1.0
    Uninstalling h5py-3.1.0:
      Successfully uninstalled h5py-3.1.0
Successfully installed h5py-2.10.0

Now we can load the weights.

network.load_weights(COCO_MODEL_PATH, by_name=True)

Define the classes of the images in the dataset.

class_names = ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
               'bus', 'train', 'truck', 'boat', 'traffic light',
               'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird',
               'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear',
               'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie',
               'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
               'kite', 'baseball bat', 'baseball glove', 'skateboard',
               'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
               'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
               'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
               'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed',
               'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
               'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',
               'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
               'teddy bear', 'hair drier', 'toothbrush']

Now since everything is set up and ready to use, we can load the input image using skimage.io. As we have seen previously, skimage.io by default loads the images in RGB format. Thus it’s convenient to display them in matplotlib.

image = skimage.io.imread('/content/Mask_RCNN/images/7933423348_c30bd9bd4e_z.jpg')
Image 180

To find out the classes of the object of our interest we can use the list of classes i.e., class_names that we have created in the previous step.

class_names.index('person'), class_names.index('car'), class_names.index('motorcycle')
>>> (1, 3, 4)

Loading the network with the image

results = network.detect([image], verbose=0)
WARNING:tensorflow:From /tensorflow-1.15.2/python3.7/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

[{'class_ids': array([ 1,  4,  1,  1,  3,  1, 14,  1,  3,  3,  3, 27], dtype=int32),
  'masks': array([[[False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False]],
         [[False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False]],
         [[False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
'rois': array([[120,  43, 268,  87],
         [187, 306, 388, 584],
         [116,   0, 270,  40],
         [124, 591, 344, 640],
         [139, 265, 265, 410],
         [144, 382, 387, 512],
         [164, 100, 202, 159],
         [130, 267, 182, 292],
         [146, 523, 199, 586],
         [130, 173, 158, 195],
         [153, 483, 239, 564],
         [180,  26, 218,  59]], dtype=int32),
  'scores': array([0.99966407, 0.9990922 , 0.99907887, 0.9985133 , 0.9934504 ,
         0.99269366, 0.98431754, 0.9770446 , 0.9476974 , 0.947486  ,
         0.9167536 , 0.701685  ], dtype=float32)}]

To better understand the output we need to go step-by-step starting from the ‘class_ids’, then ‘masks’, ‘rois’, ‘scores‘.

  • ‘class_ids’ determines to which class the objects detected in the image belong to
  • ‘masks’ displays the pixel values in the form of numpy array. Interesting part is that this numpy array consists of only two values, they are True and False. So whenever an object is detected, the values of the object in the numpy array is returned as True and rest of other values will be False
  • ‘rois’ is a numpy array that stores the points of bounding boxes that are drawn around the objects that have been detected
  • ‘scores’ is the confidence score of all the detected objects in the image.

To visualize the segmentation result done by the network we can use the visualize.dispay_instances. Even though we can view them using other classes too.

visualize.display_instances(image, r['rois'], r['masks'],
                            r['class_ids'], class_names, r['scores'])
Image 183

Removing the Background

Removing the background of an object in an image involves extracting the segment of the particular object and ignoring the other objects and background. The segment of the object is called the mask.

While displaying the mask of the object apart from the object all the other pixels of the image are normalized to false and the pixels of the object are applied true value.

array([[[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],

To obtain the masks of each object we go through the list of masks by using the index. To display the mask, boolean values True and False have to be converted into a matrix of integers. We can convert them using mask.astype('uint8'), and it returns matrix of 0’s and 1’s

Once the boolean values are converted into 0’s and 1’s we need to normalize the values by multiplying and subtracting with 255

def segment(image, r, index):
  mask = r['masks'][:,:,index]
  mask = np.stack((mask,) * 3, axis = -1)
  mask = mask.astype('uint8')
  bg = 255 - mask * 255

  mask_show = np.invert(bg)
  mask_img = image * mask

  result = mask_img + bg
  return result, mask_show
def show_segment(image, r, index, show_mask = False):
  segmentation, mask_obj = segment(image, r, index)
  plt.subplots(1, figsize=(16,16))
  if show_mask == True:
    plt.imshow(np.concatenate([mask_obj, segmentation], axis = 1))
    plt.imshow(np.concatenate([image, segmentation], axis = 1))
for index in range(2):
  show_segment(image, r, index, True)
Image 185
Image 186

For displaying all the objects we can set the range of the for loop to length of the masks list.