CS7641 - Project Team 17

Introduction

At present, the visually impaired people use a simple stick for navigation. However, the use of such a stick does not enable them to navigate independently. If they ever get into an unknown environment, the most that they can do, without any external help, is detect stationary obstacles around them. Unfortunately, they cannot make any decision based on a comprehensive understanding of the environment.

To address this problem, we intend to develop a Machine Learning model for generating accurate visual understanding of a given scene. A model of this kind can potentially be integrated into an e-stick which assists the visually impaired and enables them to move with the same ease and confidence as normally sighted people.

Problem Definition

Our project aims at developing a software framework that can detect objects from images and then answer questions based on the content of those images. From a big picture perspective, this project is a stepping stone towards engineering a system that can capture real-time images to provide high-level contextual information about the surroundings. In the future, combining such a system with a mapping and navigation module and integrating into an e-stick would enable the visually impaired to navigate independently.

Dataset

To resolve the visual challenges faced by the visually impaired people in their day-to-day lives, we present a Machine Learning model based on the Vizwiz dataset. The training dataset consists of photos taken by the blind people annotated with the question asked relevant to that image. Each annotation question also consists of answers and answer types specified by 10 people for each sample. This provides an opportunity as well a challenge to assist the visually impaired to help them in navigation, assisting their daily life tasks and answering their visual questions etc. The original Vizwiz dataset consisted of :

20,523 training image/question pairs

4,319 validation image/question pairs

8,000 test image/question pairs

Methods

Step 1. Data Preprocessing

Feature Extraction using Convolutional Neural Networks

We use transfer learning to create feature vectors for the images present in the dataset. The activations from the last layer of different 3 pre-trained models like Inceptionv3, ResNet (Residual Network) and VGG16 (Very Deep Convolutional Networks for Large-Scale Image Recognition), which are state-of-the-art and are widely used. These models are pre-trained on ImageNet, which is a huge image dataset containing more than 14 million images. Hence, they can be used for our task to create feature vectors of the images considered for our task from the Vizwiz dataset.

We first ran the Inceptionv3 using CPU and GPU, with and without batching. The time taken for each of these operations are displayed in Table 1. We also tried out different batching sizes while running the feature extraction, this information is displayed in Table 2.

Table 1: Time Taken with Different Configuration
CPU/GPU used	Time
CPU feature extraction time	1 h 42 min
CPU feature extraction time	43 min

Table 2: Execution Time vs Batching Size
Batching Sizes	Time
4	8m 2s
8	5m 9s
16	5m 14s
32	5m 11s
64	5m 2s

Unsupervised Algorithm: Clustering

On closely examining the dataset, we observed that few images are totally blurred, few are black/white and few others just have too much flash. Hence we realised the need to clean the dataset before feeding it to our training pipeline.

Challenge 1:

The first challenge that we faced is how to identify the images to be discarded.

Solution:

To identify such groups of images, we decided to perform K-means clustering on the original image dataset to generate similar clusters.

After generating clusters of similar images, we manually went through each of the clusters, and discarded the clusters that seemed useless for our use case.

The discarded clusters include clusters having blurred/ unclear images.

Challenge 2:

K-means clustering uses Euclidean distance as the metric to determine the similarity between different images. But some images might contain similar objects but they might be present in different orientations/positions/color contrasts, etc. Hence considering just the euclidean distance between the pixel values is not a good metric.

Solution:

We used CNN to detect and classify the objects and accordingly performed the clustering on these CNN generated feature vectors. Each CNN layer performs optimized operations like centering the objects, gray-scale conversion(normalization), detecting the objects, and accordingly classifying and clustering the images.

Challenge 3:

The next challenge that we faced was to identify the correct value of K(number of clusters) to be generated.

Solution:

To identify the accurate number of clusters to be considered, we calculated the loss values for a variety of distinct K values. We then plotted each of these values and used the Elbow method to determine the best value of K to be considered for our use case.

The following table depicts the loss values obtained for each value of k:

Table 3: Loss Value vs Number of Clusters
Value of K (Number of Clusters)	Loss Value
1	7674
2	7501
3	7411
5	7268
10	7051
20	6830
40	6612
60	6477
300	5957
1000	5494

The following is the elbow curve we generated based on the above values:

Hence, based on the elbow method, we obtained the accurate value of K as 60. Therefore, we generated 60 clusters. We then manually visited each of the clusters(i.e. each of the 60 folders of images), and discarded a few clusters to create a clean image dataset. In this way, the dataset was reduced from 20,523 image/question pairs to 13000 image/question pairs.

The following images show samples that were discarded:

The following images show samples that were included:

Results of Data preprocessing

To overcome challenges 1,2 and 3 we ran K-means on feature vectors generated by a forward pass of a Convolutional Neural Network to generate 60 clusters, and out of those 60 clusters, we kept 29 clusters and discarded the rest based on the reasoning provided before, thus generating a dataset consisting of 13000 image/question pairs.

Annotation Script

After generating 29 folders(clusters), we had to use only the images from these folders. To uniquely read images from 29 out of 60 folders, we wrote an annotation script.

The script reads the annotation json file, which consists of the following structure:

   
{
Image_name : “ “, // Image name
“Question: “ “, // The question associated with each image
“Answers” : {
		// List of answers from 10 people
	      }
“Answer_type”:[ yes/no], [other], [unanswerable],
“Answerable” : 0/1
}

From this file, only those images will be considered that are present in the 29 clusters, and for those image names, we read questions and answers associated with them. All this data is then loaded into a new json file.

This new json file is then used for further processing.

Verification of Clustering using PCA and T-SNE

Post k-means clustering, we selected a few clusters manually. In order to verify that the clusters selected manually were accurate, we used PCA to reduce the dimensionality and then t-SNE to visualize the high-dimensional data of images into clusters.

Principle Component Analysis(PCA)

PCA is the process of dimensionality reduction in which we can project each data point onto only to the first few principal components to obtain lower-dimensional data while preserving maximum datas variation. Here, we use PCA to reduce the dimensionality of the feature vectors of the images, so that the feature vectors can then be given as input to T-SNE.

T-SNE:

T-SNE is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. It then tries to optimize these two similarity measures using a cost function. t-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance.

In the figure below we can see that the clusters selected manually are well-defined, and hence we can clearly state that the clustering has been done accurately. We also tried tuning the hyperparameters by giving different values for (no. of clusters), to obtain different results.

Figure 14: t-distributed stochastic neighbor embedding (t-SNE)

Average and Max Pooling

We also experimented taking max pooling and average pooling post the convolution layers for both Inceptionv3 and ResNet models, the results of which are shared in Tables 4, 5, 6, 7 and 8.

Resnet V/s Inception

InceptionV3_Max_Normalized

Table 4: InceptionV3 Max Normalized
Value of K (Number of Clusters)	Loss Value
1	7691
2	7508
3	7392
5	7226
10	6991
20	6747
40	6499
60	6356
300	5796
1000	5322

InceptionV3_Avg_Normalized

Table 5: InceptionV3 Avg Normalized
Value of K (Number of Clusters)	Loss Value
1	11171
2	10832
3	10630
5	10307
10	9913
20	9508
40	9046
60	8788
300	7813
1000	7029

InceptionV3_Max_Unnormalized

Table 6: InceptionV3 Max Unnormalized
Value of K (Number of Clusters)	Loss Value
1	114840986
2	10739880
3	105134892
5	102756470
10	99803204
20	96764906
40	93636811
60	91779277
300	84182650
1000	76288868

InceptionV3_Avg_Unnormalized

Table 7: InceptionV3 Avg Unnormalized
Value of K (Number of Clusters)	Loss Value
1	5165279
2	4968424
3	4867648
5	4738606
10	4554972
20	4392889
40	4207680
60	4094941
300	3667674
1000	3312151

ResNet50_Max_Normalized

Table 8: ResNet50 Max Normalized
Value of K (Number of Clusters)	Loss Value
1	2840
2	2218
3	2026
5	1855
10	1679
20	1537
40	1421
60	1361
300	1146
1000	960

We finally use Inceptionv3 with max-pooling and run it on GPU using 32-sized batches. K-Means is run on these image feature vectors and based on the loss values received, we selected this architecture.

2. Training Pipeline:

After choosing the relevant clusters, the next task is to represent the training data to be fed into the neural network. We will be using the CNN features along with Question Text features as our input features. We will fuse these vectors together by simply stacking the matrices.

Image Representation: The CNN feature that we extracted in the above step for performing clustering (to run K-Means on) is able to capture the image features and is a good representation for our images. The size of each feature vector of an image is 2048.

Text Representation: To model question text, we initially used the bag-of-words (BOW) technique. The BOW is a way of extracting features from text (frequency) to use in modeling. The size of the embedding of each of the questions is 3042. It involves two things, a vocabulary of known words (question vocabulary) and measure of the presence of these words. We further shifted from Bag of words to BERT to generate word embeddings and analyze context in the given questions.

Question Representation

Bag of Words

To generate BOW for questions, we have used the following steps:

We applied standard preprocessing steps such as tokenizing the question texts, lower-casing all the tokens and removing all punctuation marks. We did not remove stop words (like what, how etc.) as these words appear initially in many questions and sometimes are highly correlated to the answer to a question.

After preprocessing, we get 3042 unique words in question vocabulary.

Therefore, we will represent each question using a vector of size 3042, where each column represents a unique word. For each question, we put the frequency of each word present in the question text. Thus, this generates a bag-of-words for the questions in the dataset.

Limitations of Bag of Words

Using the bag-of-words technique has the following two limitations:

Generation of Sparse Represenatations: For very large input data, the resultant vectors generated will be of large dimensions, which would in turn contain large number of null values. Hence this would lead to the generation of sparse vectors, thus blowing up the feature space.

Missing context: Bag of words does a very poor job in analyzing the context of the data. Here, using BERT, an attention-based deep learning model would help to solve the problem of contextual awareness.

BERT Question Embeddings

We use SentenceBERT to encode all questions into 768-dimensional word embeddings. SentenceBERT takes sentences as an input (it can take a sentence pair as well). These sentences are then passed into a BERT model and a pooling layer to generate embeddings. BERT stands for Bidirectional Encoder Representations from Transformers. It is a transformer-based machine learning technique for natural language processing pre-training developed by Google. SentenceBERT uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Using sentence embeddings in BERT requires pre-processing such as adding tokens like [CLS] and [SEP] tokens, denoting the start and end of the sentence. We use the BERT tokenizer to tokenize the sentences. We take the last hidden layer of SentenceBERT to compute the embeddings. These embeddings are of dimension 768 for each question sentence.

Answer Representation

Similar to Question text representation, we also need to model answer text. The unique thing about the Vizwiz dataset is that for each question, we have 10 answers that are annotated. So we have multiple ground truth labels for each question. We decided not to use One hot encoding because that would lead to very high dimensional label vectors, blowing up the feature space and making the model suffer from the curse of dimensionality. Thus, we followed the following steps:

We combined answers of all the questions together to form a large corpus.

We used similar preprocessing steps (like used for questions) such as tokenizing the answer lists and removing all punctuation marks.

After the preprocessing, we count the frequency of each word in this text corpus and choose the top K most frequent words. Here we have taken top-K(=3000) unique words from the answer vocabulary.

So, we create a vector of size K for the answer label for each question.

We represent this K sized answer label for each question by creating the vector with the count of each of words that appear in top K words in the 10 answers.

Architecture

As discussed in the training pipeline, we have the image + question embedding for each of the (image, question) pairs. For image embeddings, we use the frozen parameters from the Inceptionv3 model learned on ImageNet classification, and no fine-tuning was performed.

Previous Architecture using Bag-of-Words

We concatenate the BOW questions and images (2048 + 3042), to get a resultant vector of dimension 5090. We use this combined input for the Multi-Layer Perceptron (MLP). MLP is a fully connected neural network classifier consisting of 4 hidden layers with 5090 hidden units. We use ReLU for activation, which is finally followed by a log softmax layer to obtain a probability distribution over the top K answers. Since the VizWiz dataset has 10 ground-truth answers for each training instance, we use a "soft" cross-entropy loss so that the model optimises the weights by considering each ground-truth answer.

Figure 11: Training Pipeline Architecture

Current Architecture using BERT Embeddings

in the current architecture we integrate the BERT question embeddings and images (2048 + 768), to get a resultant vector of dimension 2816. Here, we have generated results by changing the number of hidden layers, hidden units, neurons, drop-out rates, learning rates as well as with/without regularization. Similar to the previous architecture, we use ReLU as the activation function, which is followed by a log softmax layer to obtain a probability distribution over the top K answers.

Loss Function Used

The proposed loss function, termed as soft cross entropy, is a simple weighted average of each unique ground-truth answer.

here c is a vector of unique ground-truth answers and w is a vector of answer weights computed as the number of times the unique answer appears in the ground-truth set divided by the total number of answers.

Accuracy Metric Used

The 10 answers given by humans for each visual question can differ and therefore, a prediction to a visual question can be 100% correct if the answer has at least 3 occurrences, ∼67% if the occurrences are exactly 2, ∼33% if the answer appears only once in the sample annotations. Hence, we use the following accuracy metric:

Preliminary Results for Mid-term

The following are the preliminary results of training on randomly selected samples from the filtered VizWiz dataset.

Epochs: 150, Training Instances: 10000, Validation Instances: 3108, Optimiser: Adam, Learning Rate: 1e-6

Hidden Layers: 4, Activation: ReLU

Preliminary Results — Figure 13: Train Loss

Final Results

BERT and BOW Models without dropout and regularization

Model 1: Baseline Accuracy Graph for epochs = 100, 4 Hidden Layers for training dataset without dropout and regularization

Baseline Accuracy Graph for epochs = 100, 4 Hidden Layers for validation dataset without dropout and regularization

Model 2: BERT Accuracy Graph for epochs = 100, 4 Hidden Layers for training dataset without dropout and regularization

BERT Accuracy Graph for epochs = 100, 4 Hidden Layers for validation dataset without dropout and regularization

BERT and BOW Models with dropout and regularization

Model 1: 4 Hidden Layers with BOW with Learning rate set to 1e-6.

Model 2: 4 Hidden Layers with BERT embeddings with Learning rate set to 1e-5.

Model 3: 4 Hidden Layers with BERT embeddings with Learning rate set to 1e-4.

Model 4: 1 Hidden Layer with BERT embeddings , Learning rate set to 1e-5 and number of neurons set to 5000.

Model 5: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 3000 and L2 regularization(1e-5).

Model 6: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 1024 and L2 regularization(1e-5).

Model 7: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 1024 and L2 regularization(1e-5) only on training loss.

Model 8: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 512 and L2 regularization(1e-5).

Model 9: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 256 and L2 regularization(1e-5).

Model 10: 2 Hidden Layers with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 128 and L2 regularization(1e-5).

Model 11: 1 Hidden Layers with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 128 and L2 regularization(1e-5).

Accuracy Analysis:

The baseline accuracy is around 33%, and the final accuracy of the model with BERT is 35.8%. We see only a slight increase in the accuracy. Since the visually impaired people take pictures as well as ask questions themselves, the dataset being used is a challenging dataset for modern vision algorithms. It imposes the following challenges for model's training:

Images are often of poor quality due to low focus and poor lighting, which might lead to less accurate results.

Questions are on average more conversational/absurd and in some cases are incomplete due to audio recording imperfections.

Visually impaired people are also not able to verify the correctness of image captured which sometimes may lead to mismatch between image and question asked.

We are using pre-trained CNN architectures such as Inceptionv3 and ResNet for creating image features and pretrained BERT sentence transformers to create text features. We have not tuned these features to our dataset. And this might lead to a slightly lower accuracy than expected.

We attribute the poor generalisation of this algorithm largely to its inability to predict answers observed in the dataset because only 824 out of the top 3000 answers in Vizwiz are included in the dataset used to train the model.

Conclusion and Future Scope

In this way our project aims at developing a software framework that can detect objects from images and then answer questions based on the content of those images. From a big picture perspective, this project provides high-level contextual information about the surroundings to the visually impaired.

Following are some ways we can work on the future aspects of the project:

To improve the dataset, data augmentation can be performed, such as adding data from other datasets. We can also work on adding further images using translation, rotation, flipping etc. And work on adding questions and answer sets in a similar way.

Once accuracy is improved, our model can be utilized by the visually impaired for navigation purposes and performing daily chores. We can also request their feedback and input on developing the model

Currently the dataset includes static images. In the future, video based analysis can be done, to make the system applicable to dynamic real time use cases.

Our project currently takes input (questions) in a text format and delivers output (answers) also in the text format. In the future, for the ease of the visually impaired people, there needs to be a speech to text system in place which can convert the questions asked by the blind people to text and there should also be a text to speech system which can convert the answers back to speech and inform the blind people.

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425-2433).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Ilievski, I., & Feng, J. (2017). A simple loss function for improving the convergence and accuracy of visual question answering models. arXiv preprint arXiv:1708.00584.

Dushi, D. (2019). Using Deep Learning to Answer Visual Questions from Blind People.

Zeeshan Saquib, et. al.BlinDar: An Invisible Eye for the Blind People, IEEE International Conference On Recent Trends In Electronics Information Communication Technology, India, May 2017.

ImageNet

Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks