Rosella       Machine Intelligence & Data Mining

Training Data Preparation for Computer Vision

Data preparation is the single most laborious process in computer vision. Images are collected and sorted into file folders. Relevant parts of images are cropped and create training dataset images. Or objects are marked with bounding boxs with classification labels. Data preparation performed in the following sequence;

  • Collect images.
  • Crop and create relevant parts of images (for classification and regression) or label objects (for object detection).
  • Prepare an integration configuration.
  • Import into CMSR ML Studio using the prepared integration configuration.

Data preparation process differs depending on modeling types as described in the subsequent sections.

It is important to note that only good data preparation with systematic augmentation can lead to successful computer vision models.

Data Augmentation

Data augmentation creates similar training dataset images from an original image by cropping and rotating and shifting. Augmentation in CMSR Machine Learning Studio is done in two phases as follows;

  • Augmented image creation: Cropped, rotated, and shifted images are created. This is done manually one image by one image using CMSR Labeling program.
  • Import time augmentation: During image import into CMSR, top/left/bottom/right portions are chopped out, creating multiple augmented images. This is done by simply choosing options. Note that this has image stretching effect.

You don't need augmentation for color brightness since you can choose histogram equalization in network model configuration. Histogram equalization will normalize brightness and colors.

Systematic Augmentation

Augmentation should be done systematically. Otherwise it may result in redundant training data. This will lead poor accuracy on unseen data and increased training time.

Color Normalization with Histogram Equalization

Histogram equalization normalizes color contrast and blightness. There is no need for augmentation for color contrast and blightness. Unless there are compelling reasons not to use, use histogram equalization.

Quantity vs Quality and Incremental Training Data Creation

Quality of training data is very important. A bunch of many training data images won't be useful if it lacks quality. Also redundant training data lead to increased training data. This can blow out training data images quite easily. Controlling training dataset size is important. Start with typical images, say, 300 to 500 original images. Create systematically augmentated images. Fully train them. Test with unseen images and identify weakness and error images. Add error images to training dataset and retrain. Repeat this process until no major error images are detected.

Multiple Models and Bagging

Train multiple models and test them. Choose the best ones. Four models are nice. Remove the best and worst values. Take the average of the remaining two as the final result.

Computer Vision Model Types of CMSR Studio

CMSR Studio has the following computer vision modeling tools;

  • CNN (=Convolutional Neural Network) and FCN (=Fully Convolutional Network): Classifies a given image as an object type or predicts a probability of a specific type.
  • M-CNN (=Multiple value output CNN): Predicts one or more numberical values.
  • OD-CNN (=Object Detection CNN): Detects objects as bounding boxes such as X/Y-coordinates, width, height, etc.
  • T-CNN (=Twin CNN): Predicts a single numerical value, normally similarity probability, e.g., face recognition.
  • S-CNN (=Stereoscopic CNN): Predicts a grid of numberical values for given stereoscopic two images.

Data Preparation for CNN and FCN (Classification or Class Probability)

Data preparation for CNN and FCN is rather simple. Create a base folder. Under this folder, create multiple subfolers. Each subfolder should store of the same class of images. For example, a subfolder for cars, another for trucks, another for buses, etc. You may create multiple subfolders for a class. But each subfolder must be the same class images. From CMSR, do the following;

  1. From "Tools" menu, select "Generate image lists for labeling (CNN/FCN).
  2. Select a subfolder.
  3. Enter class name for the subfolder. For example, "car", "truck", "bus", etc.

You need to perform for each subfolder you created with image data. The following images show examples of CNN/FCN images.

Data Preparation for M-CNN (Multivalue Output Regression)

Create a base file folder. Under the base folder, create a number of subfolders. If image groups have same output values, create a subfolder for each image group having the same output values.

Data prparation for M-CNN can get tricky if you have to assign training dataset image output values for each image. In this case, you need to create output value text file for each image. This should be automated using your data collection software.

Store each image with a (text) file containing numerical target values in a single line. Values should be separated by a comma ",". File name must be image full file name suffixed with ".txt" or ".csv". For a subdirectory, if values are the same for all images, create "defaultvalues.txt" or "defaultvalues.csv" and omit image-wise values files. The following is an example .txt/.csv file value;


From CMSR Studio, do the following. This is to create image file lists;

  1. From "Tools" menu, select "Generate image lists for labeling (M-CNN)".
  2. Select a subfolder.
  3. Enter class name as "mcnn".

You need to perform for each subfolder you created with image data. The following images show examples of M-CNN images.

Data Preparation for OD-CNN (Object Detection)

Create a base file folder. Under the base folder, create a number of subfolders. You can put images in any way. But better to store same type images together. Object detection data prepartion can be most laborious process. For each image, you need to identify objects and mark with "tight" bounding boxes with a classification label name. This is done with CMSR Labeler software. The following figure shows how object bounding boxes are marked in OD-CNN training dataset images. Green boxes are marked object bounding boxes. You need to mark these bounding boxes using CMSR Labeler program. (In fact, these boxes are detected objects by a trainined OD-CNN model.)

Then from CMSR Studio, do the following;

  1. From "Tools" menu, select "Generate image lists for labeling (OD-CNN).
  2. Select a subfolder.
  3. Enter class name as "odcnn".

You need to perform for each subfolder you created with image data. The following images show examples of OD-CNN images.

Data Preparation for T-CNN

Create a base file folder. Under the base folder, create one or more subfolders. Under each subfolder, create a subsubfolder for each object. Each subsubfolder must contain all of identical object's images. No folder-wise processing is needed. Don't put too many images in a subsub folder! During import, CMSR creates all possible pairs of images of a subsub folder in training dataset. This can lead to combinatorial growth of training dataset size. Several images per subsub folder are recommended. The following images show examples of OD-CNN images.

Preparing Integration Configuration File

To import data into CMSR Studio, you need to create a subfolder list and specify class names. It is a text file. It is normally located in a base folder with your preferred name with ".txt" extension name. This file differs slightly depending on model types as follows;

[ CNN / FCN / OD-CNN ]

Use the following two formats to specify label names and inclusion of image list files;

@label: label_name
@include: relative_file_pathe_of_cmsrimagelist.txt_of_each_subfolder

"@label" is to specify classification label names. It should be specified following the label name order. "@include" is to list "cmsrimagelist.txt" files in subfolders. The following is an example for this. It has four class labels: brid, car, horse, truck. Then list subfolder with "cmsrimagelist.txt" files;

@label: bird
@label: car
@label: horse
@label: truck
@include: Bird\cmsrimagelist.txt
@include: Car\cmsrimagelist.txt
@include: Horse\cmsrimagelist.txt
@include: Truck\cmsrimagelist.txt
@include: Bird1\cmsrimagelist.txt

[ M-CNN ]

M-CNN is also same as CNN. But "@label" is replaced with "@target" as below. This specified output variable names and must appear in the order of output values.

@target: output_variable_name

[ T-CNN ]

T-CNN does not use "cmsrimagelist.txt" list files. Rather it relies on subsub folder containing all images of identical objects. So integeration file contans a list of "subfolder" names. The following is an example. The base folder has three subfolders: Female1, Male1, and Male2. Note that you can merge these folders as long as subsub folder names are unique. Note that using multiple sub folders is to allow incremental data creation. You can create additional sub folder. Then put new image subsub folders under the new subfolder. Then add a new "@include:" line in the configuration file.

@include: Female1
@include: Male1
@include: Male2

Import Data into CMSR

Once an integration configuration file is prepared, you can import training dataset into CMSR Studio with the integration configuration file. To import training dataset, perform the following;

  1. From "Data" menu, select "Import image dataset (xxx)". Then the following figure-like dialog window will appear.
  2. Enter dataset name.
  3. Select modeling image width and height.
  4. Choose "CMSR format" and press the "Image list file". Choose the integration configuration file.
  5. Choose "Options/Augmentation" options.
  6. Press the "Import" button.

Import can take significant time depending on the number of images. Then you are ready to configure a neural network and train!

Computer Vision Data Import