Street issue. As I mentioned earlier thatStreet issue. As I mentioned earlier that

Street view number detection is called natural scene text recognition problem which is quite different from printed character or handwritten recognition. Research in this field was started in 90’s, but still it is considered as an unsolved issue. As I mentioned earlier that the difficulties arise due to fonts variation, scales, rotations, low lights etc.

     In earlier years to deal with natural scene text identification sequentially, first character classification by sliding window or connected components mainly used. 4 After that word prediction can be done by predicting character classifier in left to right manner. Recently segmentation method guided by supervised classifier use where words can be recognized through a sequential beam search. 4 But none of this can help to solve the street view recognition problem.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

     In recent works convolutional neural networks proves its capabilities more accurately to solve object recognition task. 4 Some research has done with CNN to tackle scene text recognition tasks. 4 Studies on CNN shows its huge capability to represent all types of character variation in the natural scene and till now it is holding this high variability. Analysis with convolutional neural network stars at early 80’s and it successfully applied for handwritten digit recognition in 90’s. 4 With the recent development of computer resources, training sets, advance algorithm and dropout training deep convolutional neural networks become more efficient to recognize natural scene digit and characters. 3

      Previously CNN used mainly to detecting a single object from an input image. It was quite difficult to isolate each character from a single image and identify them. Goodfellow et al., solve this problem by using deep large CNN directly to model the whole image and with a simple graphical model as the top inference layer. 4

      The rest of the paper is designed in section III Convolutional neural network architecture, section IV Experiment, Result, and Discussion and Future Work and Conclusion in section V.          

Convolutional Neural Networks (CNN) is a multilayer network to handle complex and high-dimensional data, its architecture is same as typical neural networks. 8 Each layer contains some neuron which carries some weight and biases. Each neuron takes images as inputs, then move onward for implementation and reduce parameter numbers in the network. 7 The first layer is a convolutional layer. Here input will be convoluted by a set of filters to extract the feature from the input. The size of feature maps depends on three parameters: number of filters, stride size, padding. After each convolutional layer, a non-linear operation, ReLU use. It converts all negative value to zero. Next is pooling or sub-sampling layer, it will reduce the size of feature maps. Pooling can be different types: max, average, sum. But max pooling is generally used. Down-sampling also controls overfitting. Pooling layer output is using to create feature extractor. Feature extractor retrieves selective features from the input images. These layers will have moved to fully connected layers (FCL) and the output layer. In CNN previous layer output considers as next layer input. For the different type of problem, CNN is different.

 

      The main objective of this project is detecting and identifying house-number signs from street view images. The dataset I am considering for this project is street view house numbers dataset taken from 5 has similarities with MNIST dataset. The SVHN dataset has more than 600,000 labeled characters and the images are in .png format. After extract the dataset I resize all images in 32×32 pixels with three color channels. There are 10 classes, 1 for each digit. Digit ‘1’ is label as 1, ‘9’ is label as 9 and ‘0’ is label as 10. 5 The dataset is divided into three subgroups: train set, test set, and extra set. The extra set is the largest subset contains almost 531,131 images. Correspondingly, train dataset has 73,252 and test data set has 26,032 images.

     Figure 3 is an example of the original, variable-resolution, colored house-number images where each digit is marked by bounding boxes. Bounding box information is stored in digitStruct.mat file, instead of drawn directly on the images in the dataset. digitStruct.mat file contains a struct called digitStruct with the same length of original images. Each element in digitStruct has the following fields: “name” which is a string containing the filename of the corresponding image. “bbox” is a struct array that contains the position, size, and label of each digit bounding box in the image. As an example, digitStruct(100). bbox (1). height means the height of the 1st digit bounding box in the 100th image. 5

 

This is very clear from Figure 3 that in SVHN dataset maximum house numbers signs are printed signs and they are easy to read. 2 Because there is a large variation in font, size, and colors it makes the detection very difficult. The variation of resolution is also large here. (Median: 28 pixels. Max: 403 pixels. Min: 9 pixels). 2 The graph below indicates that there is the large variation in character heights as measured by the height of the bounding box in original street view dataset. That means the size of all characters in the dataset, their placement, and character resolution is not evenly distributed across the dataset. Due to data are not uniformly distributed it is difficult to make correct house number detection

In my experiment, I train a multilayer CNN for street view house numbers recognition and check the accuracy of test data. The coding is done in python using Tensorflow, a powerful library for implementation and training deep neural networks. The central unit of data in TensorFlow is the tensor. A tensor consists of a set of primitive values shaped into an array of any number of dimensions. A tensor’s rank is its number of dimensions. 9 Along with TensorFlow used some other library function such as Numpy, Mathplotlib, SciPy etc.

 

I perform my analysis only using the train and test dataset due to limited technical resources. And omit extra dataset which is almost 2.7GB. To make the analysis simpler delete all those data points which have more than 5 digits. By preprocessing the data from the original SVHN dataset a pickle file is created which being used in my experiment. For the implementation, I randomly shuffle valid dataset and then used the pickle file and train a 7-layer Convoluted Neural Network.

 

At the very beginning of the experiment, first convolution layer has 16 feature maps with 5×5 filters, and originate 28x28x16 output. A few ReLU layers are also added after each convolution layer to add more non-linearity to the decision-making process. After first sub-sampling the output size decrease in 14x14x10. The second convolution has 512 feature maps with 5×5 filters and produces 10x10x32 output. By applying sub-sampling second time get the output size 5x5x32. Finally, the third convolution has 2048 feature maps with same filter size. It is mentionable that the stride size =1 in my experiment along with zero padding. During my experiment, I use dropout technique to reduce the overfitting. Finally, SoftMax regression layer is used to get the final output.

Weights are initialized randomly using Xavier initialization which keeps the weights in the right range. It automatically scales the initialization based on the number of output and input neurons. After model buildup, start train the network and log the accuracy, loss and validation accuracy for every 500 steps.Once the process is done then get the test set accuracy. To minimize the loss, Adagrad Optimizer used. After reach in a suitable accuracy level stop train the network and save the hyperparameters in a checkpoint file. When we need to perform the detection, the program will load the checkpoint file without train the model again.

Initially, the model produced an accuracy of 89% with just 3000 steps. It’s a great starting point and certainly, after a few times of training the accuracy will reach in 90%. However, I added some additional features to increase accuracy. First, added a dropout layer between the third convolution layer and fully connected layer. This allows the network to become more robust and prevents overfitting. Secondly, introduced exponential decay to calculate learning rate with an initial rate 0.05. It will decay in each 10,000 steps with a base of 0.95. This helps the network to take bigger steps at first so that it learns fast but over time as we move closer to the global minimum, it will take smaller steps. With these changes, the model is now able to produce an accuracy of 91.9% on the test set. Since there are a large training set and test set, there is a chance of more improvement if the model will train for a longer time.

In the experiment I proposed a multi-layer deep convolutional neural network to recognize the street view house number. The testing done on more than 600,000 images and achieve almost 92% accuracy. From the analysis it is vibrant that the model produces correct output for most images. However, the detection may fail if the Image is blurry, or contain any noise etc.

 Most exciting feature of the project is to discover the performance of extra applied tricks like dropout and exponential learning rate decay on real data. As there are many variation of CNN architecture can be implemented it’s very difficult to understand why an architecture will work best for any specific type data. Due to that to find out a most appropriate CNN architecture is most challenging aspect of the experiment.  The model implemented here is relatively simple but does the job very well and is quite robust, however it’s still requires a lot of work to make the model perform equivalent or better than a human operator. As a future work, I will extend my experiment using different technique and algorithms. And try to find out which one has better accuracy with minimum cost and less number of loss.