Skip to the content.

CS7641 - Project Team 17


At present, the visually impaired people use a simple stick for navigation. However, the use of such a stick does not enable them to navigate independently. If they ever get into an unknown environment, the most that they can do, without any external help, is detect stationary obstacles around them. Unfortunately, they cannot make any decision based on a comprehensive understanding of the environment.

To address this problem, we intend to develop a Machine Learning model for generating accurate visual understanding of a given scene. A model of this kind can potentially be integrated into an e-stick which assists the visually impaired and enables them to move with the same ease and confidence as normally sighted people.

Problem Definition

Our project aims at developing a software framework that can detect objects from images and then answer questions based on the content of those images. From a big picture perspective, this project is a stepping stone towards engineering a system that can capture real-time images to provide high-level contextual information about the surroundings. In the future, combining such a system with a mapping and navigation module and integrating into an e-stick would enable the visually impaired to navigate independently.


To resolve the visual challenges faced by the visually impaired people in their day-to-day lives, we present a Machine Learning model based on the Vizwiz dataset. The training dataset consists of photos taken by the blind people annotated with the question asked relevant to that image. Each annotation question also consists of answers and answer types specified by 10 people for each sample. This provides an opportunity as well a challenge to assist the visually impaired to help them in navigation, assisting their daily life tasks and answering their visual questions etc. The original Vizwiz dataset consisted of :

  • 20,523 training image/question pairs
  • 4,319 validation image/question pairs
  • 8,000 test image/question pairs
  • Methods

    Step 1. Data Preprocessing

    Feature Extraction using Convolutional Neural Networks

    We use transfer learning to create feature vectors for the images present in the dataset. The activations from the last layer of different 3 pre-trained models like Inceptionv3, ResNet (Residual Network) and VGG16 (Very Deep Convolutional Networks for Large-Scale Image Recognition), which are state-of-the-art and are widely used. These models are pre-trained on ImageNet, which is a huge image dataset containing more than 14 million images. Hence, they can be used for our task to create feature vectors of the images considered for our task from the Vizwiz dataset.

    We first ran the Inceptionv3 using CPU and GPU, with and without batching. The time taken for each of these operations are displayed in Table 1. We also tried out different batching sizes while running the feature extraction, this information is displayed in Table 2.

    Table 1: Time Taken with Different Configuration
    CPU/GPU used Time
    CPU feature extraction time 1 h 42 min
    CPU feature extraction time 43 min

    Table 2: Execution Time vs Batching Size
    Batching Sizes Time
    4 8m 2s
    8 5m 9s
    16 5m 14s
    32 5m 11s
    64 5m 2s

    Unsupervised Algorithm: Clustering

    On closely examining the dataset, we observed that few images are totally blurred, few are black/white and few others just have too much flash. Hence we realised the need to clean the dataset before feeding it to our training pipeline.

    Challenge 1:

    The first challenge that we faced is how to identify the images to be discarded.


  • To identify such groups of images, we decided to perform K-means clustering on the original image dataset to generate similar clusters.
  • After generating clusters of similar images, we manually went through each of the clusters, and discarded the clusters that seemed useless for our use case.
  • The discarded clusters include clusters having blurred/ unclear images.
  • Challenge 2:

    K-means clustering uses Euclidean distance as the metric to determine the similarity between different images. But some images might contain similar objects but they might be present in different orientations/positions/color contrasts, etc. Hence considering just the euclidean distance between the pixel values is not a good metric.


    We used CNN to detect and classify the objects and accordingly performed the clustering on these CNN generated feature vectors. Each CNN layer performs optimized operations like centering the objects, gray-scale conversion(normalization), detecting the objects, and accordingly classifying and clustering the images.

    Challenge 3:

    The next challenge that we faced was to identify the correct value of K(number of clusters) to be generated.


    To identify the accurate number of clusters to be considered, we calculated the loss values for a variety of distinct K values. We then plotted each of these values and used the Elbow method to determine the best value of K to be considered for our use case.

    The following table depicts the loss values obtained for each value of k:

    Table 3: Loss Value vs Number of Clusters
    Value of K (Number of Clusters) Loss Value
    1 7674
    2 7501
    3 7411
    5 7268
    10 7051
    20 6830
    40 6612
    60 6477
    300 5957
    1000 5494

    The following is the elbow curve we generated based on the above values:

    KMeans Elbow Curve
    Figure 1: KMeans Elbow Curve

    Hence, based on the elbow method, we obtained the accurate value of K as 60. Therefore, we generated 60 clusters. We then manually visited each of the clusters(i.e. each of the 60 folders of images), and discarded a few clusters to create a clean image dataset. In this way, the dataset was reduced from 20,523 image/question pairs to 13000 image/question pairs.

    The following images show samples that were discarded:

    Discarded Cluster Sample 1
    Figure 2: Discarded Cluster Sample 1
    Discarded Cluster Sample 2
    Figure 3: Discarded Cluster Sample 2

    The following images show samples that were included:

    Included Cluster Sample 1
    Figure 4: Included Cluster Sample 1
    Included Cluster Sample 2
    Figure 5: Included Cluster Sample 2

    Results of Data preprocessing

    To overcome challenges 1,2 and 3 we ran K-means on feature vectors generated by a forward pass of a Convolutional Neural Network to generate 60 clusters, and out of those 60 clusters, we kept 29 clusters and discarded the rest based on the reasoning provided before, thus generating a dataset consisting of 13000 image/question pairs.

    Annotation Script

    After generating 29 folders(clusters), we had to use only the images from these folders. To uniquely read images from 29 out of 60 folders, we wrote an annotation script.

    The script reads the annotation json file, which consists of the following structure:

    Image_name : “ “, // Image name
    “Question: “ “, // The question associated with each image
    “Answers” : {
    		// List of answers from 10 people
    “Answer_type”:[ yes/no], [other], [unanswerable],
    “Answerable” : 0/1

    From this file, only those images will be considered that are present in the 29 clusters, and for those image names, we read questions and answers associated with them. All this data is then loaded into a new json file.

    This new json file is then used for further processing.

    Verification of Clustering using PCA and T-SNE

    Post k-means clustering, we selected a few clusters manually. In order to verify that the clusters selected manually were accurate, we used PCA to reduce the dimensionality and then t-SNE to visualize the high-dimensional data of images into clusters.

    Principle Component Analysis(PCA)

    PCA is the process of dimensionality reduction in which we can project each data point onto only to the first few principal components to obtain lower-dimensional data while preserving maximum datas variation. Here, we use PCA to reduce the dimensionality of the feature vectors of the images, so that the feature vectors can then be given as input to T-SNE.


    T-SNE is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. It then tries to optimize these two similarity measures using a cost function. t-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance.

    In the figure below we can see that the clusters selected manually are well-defined, and hence we can clearly state that the clustering has been done accurately. We also tried tuning the hyperparameters by giving different values for (no. of clusters), to obtain different results.

    t-distributed stochastic neighbor embedding
    Figure 14: t-distributed stochastic neighbor embedding (t-SNE)

    Average and Max Pooling

    We also experimented taking max pooling and average pooling post the convolution layers for both Inceptionv3 and ResNet models, the results of which are shared in Tables 4, 5, 6, 7 and 8.

    Resnet V/s Inception


    Table 4: InceptionV3 Max Normalized
    Value of K (Number of Clusters) Loss Value
    1 7691
    2 7508
    3 7392
    5 7226
    10 6991
    20 6747
    40 6499
    60 6356
    300 5796
    1000 5322
    KMeans Elbow Curve (InceptionV3 Max Normalized)
    Figure 6: KMeans Elbow Curve (InceptionV3 Max Normalized)


    Table 5: InceptionV3 Avg Normalized
    Value of K (Number of Clusters) Loss Value
    1 11171
    2 10832
    3 10630
    5 10307
    10 9913
    20 9508
    40 9046
    60 8788
    300 7813
    1000 7029
    KMeans Elbow Curve (InceptionV3 Avg Normalized)
    Figure 7: KMeans Elbow Curve (InceptionV3 Avg Normalized)


    Table 6: InceptionV3 Max Unnormalized
    Value of K (Number of Clusters) Loss Value
    1 114840986
    2 10739880
    3 105134892
    5 102756470
    10 99803204
    20 96764906
    40 93636811
    60 91779277
    300 84182650
    1000 76288868
    KMeans Elbow Curve (InceptionV3 Max Unnormalized)
    Figure 8: KMeans Elbow Curve (InceptionV3 Max Unnormalized)


    Table 7: InceptionV3 Avg Unnormalized
    Value of K (Number of Clusters) Loss Value
    1 5165279
    2 4968424
    3 4867648
    5 4738606
    10 4554972
    20 4392889
    40 4207680
    60 4094941
    300 3667674
    1000 3312151

    KMeans Elbow Curve (InceptionV3 Avg Unnormalized)
    Figure 9: KMeans Elbow Curve (InceptionV3 Avg Unnormalized)


    Table 8: ResNet50 Max Normalized
    Value of K (Number of Clusters) Loss Value
    1 2840
    2 2218
    3 2026
    5 1855
    10 1679
    20 1537
    40 1421
    60 1361
    300 1146
    1000 960

    KMeans Elbow Curve (ResNet50 Max Normalized)
    Figure 7: KMeans Elbow Curve (ResNet50 Max Normalized)

    We finally use Inceptionv3 with max-pooling and run it on GPU using 32-sized batches. K-Means is run on these image feature vectors and based on the loss values received, we selected this architecture.

    2. Training Pipeline:

    After choosing the relevant clusters, the next task is to represent the training data to be fed into the neural network. We will be using the CNN features along with Question Text features as our input features. We will fuse these vectors together by simply stacking the matrices.

  • Image Representation: The CNN feature that we extracted in the above step for performing clustering (to run K-Means on) is able to capture the image features and is a good representation for our images. The size of each feature vector of an image is 2048.
  • Text Representation: To model question text, we initially used the bag-of-words (BOW) technique. The BOW is a way of extracting features from text (frequency) to use in modeling. The size of the embedding of each of the questions is 3042. It involves two things, a vocabulary of known words (question vocabulary) and measure of the presence of these words. We further shifted from Bag of words to BERT to generate word embeddings and analyze context in the given questions.
  • Question Representation

    Bag of Words

    To generate BOW for questions, we have used the following steps:

  • We applied standard preprocessing steps such as tokenizing the question texts, lower-casing all the tokens and removing all punctuation marks. We did not remove stop words (like what, how etc.) as these words appear initially in many questions and sometimes are highly correlated to the answer to a question.
  • After preprocessing, we get 3042 unique words in question vocabulary.
  • Therefore, we will represent each question using a vector of size 3042, where each column represents a unique word. For each question, we put the frequency of each word present in the question text. Thus, this generates a bag-of-words for the questions in the dataset.
  • Limitations of Bag of Words

    Using the bag-of-words technique has the following two limitations:
  • Generation of Sparse Represenatations: For very large input data, the resultant vectors generated will be of large dimensions, which would in turn contain large number of null values. Hence this would lead to the generation of sparse vectors, thus blowing up the feature space.
  • Missing context: Bag of words does a very poor job in analyzing the context of the data. Here, using BERT, an attention-based deep learning model would help to solve the problem of contextual awareness.
  • BERT Question Embeddings

    We use SentenceBERT to encode all questions into 768-dimensional word embeddings. SentenceBERT takes sentences as an input (it can take a sentence pair as well). These sentences are then passed into a BERT model and a pooling layer to generate embeddings. BERT stands for Bidirectional Encoder Representations from Transformers. It is a transformer-based machine learning technique for natural language processing pre-training developed by Google. SentenceBERT uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Using sentence embeddings in BERT requires pre-processing such as adding tokens like [CLS] and [SEP] tokens, denoting the start and end of the sentence. We use the BERT tokenizer to tokenize the sentences. We take the last hidden layer of SentenceBERT to compute the embeddings. These embeddings are of dimension 768 for each question sentence.

    Answer Representation

    Similar to Question text representation, we also need to model answer text. The unique thing about the Vizwiz dataset is that for each question, we have 10 answers that are annotated. So we have multiple ground truth labels for each question. We decided not to use One hot encoding because that would lead to very high dimensional label vectors, blowing up the feature space and making the model suffer from the curse of dimensionality. Thus, we followed the following steps:

  • We combined answers of all the questions together to form a large corpus.
  • We used similar preprocessing steps (like used for questions) such as tokenizing the answer lists and removing all punctuation marks.
  • After the preprocessing, we count the frequency of each word in this text corpus and choose the top K most frequent words. Here we have taken top-K(=3000) unique words from the answer vocabulary.
  • So, we create a vector of size K for the answer label for each question.
  • We represent this K sized answer label for each question by creating the vector with the count of each of words that appear in top K words in the 10 answers.
  • Architecture

    As discussed in the training pipeline, we have the image + question embedding for each of the (image, question) pairs. For image embeddings, we use the frozen parameters from the Inceptionv3 model learned on ImageNet classification, and no fine-tuning was performed.

    Previous Architecture using Bag-of-Words

    We concatenate the BOW questions and images (2048 + 3042), to get a resultant vector of dimension 5090. We use this combined input for the Multi-Layer Perceptron (MLP). MLP is a fully connected neural network classifier consisting of 4 hidden layers with 5090 hidden units. We use ReLU for activation, which is finally followed by a log softmax layer to obtain a probability distribution over the top K answers. Since the VizWiz dataset has 10 ground-truth answers for each training instance, we use a "soft" cross-entropy loss so that the model optimises the weights by considering each ground-truth answer.

    Training Pipeline Architecture
    Figure 11: Training Pipeline Architecture

    Current Architecture using BERT Embeddings

    in the current architecture we integrate the BERT question embeddings and images (2048 + 768), to get a resultant vector of dimension 2816. Here, we have generated results by changing the number of hidden layers, hidden units, neurons, drop-out rates, learning rates as well as with/without regularization. Similar to the previous architecture, we use ReLU as the activation function, which is followed by a log softmax layer to obtain a probability distribution over the top K answers.

    Training Pipeline Architecture
    Figure 12: Training Pipeline Architecture

    Loss Function Used

    The proposed loss function, termed as soft cross entropy, is a simple weighted average of each unique ground-truth answer.

    Loss Function Equation

    here c is a vector of unique ground-truth answers and w is a vector of answer weights computed as the number of times the unique answer appears in the ground-truth set divided by the total number of answers.

    Accuracy Metric Used

    The 10 answers given by humans for each visual question can differ and therefore, a prediction to a visual question can be 100% correct if the answer has at least 3 occurrences, ∼67% if the occurrences are exactly 2, ∼33% if the answer appears only once in the sample annotations. Hence, we use the following accuracy metric:

    Loss Function Equation

    Preliminary Results for Mid-term

    The following are the preliminary results of training on randomly selected samples from the filtered VizWiz dataset.

  • Epochs: 150, Training Instances: 10000, Validation Instances: 3108, Optimiser: Adam, Learning Rate: 1e-6
  • Hidden Layers: 4, Activation: ReLU
  • Preliminary Results
    Figure 13: Train Loss
    Preliminary Results
    Figure 14: Validation Loss

    Final Results

    BERT and BOW Models without dropout and regularization

    Model 1: Baseline Accuracy Graph for epochs = 100, 4 Hidden Layers for training dataset without dropout and regularization
    Preliminary Results
    Figure 15: Training Dataset
    Baseline Accuracy Graph for epochs = 100, 4 Hidden Layers for validation dataset without dropout and regularization
    Preliminary Results
    Figure 16: Validation Dataset
    Model 2: BERT Accuracy Graph for epochs = 100, 4 Hidden Layers for training dataset without dropout and regularization
    Preliminary Results
    Figure 17: Training Dataset
    BERT Accuracy Graph for epochs = 100, 4 Hidden Layers for validation dataset without dropout and regularization
    Preliminary Results
    Figure 18: Validation Dataset

    BERT and BOW Models with dropout and regularization

    Model 1: 4 Hidden Layers with BOW with Learning rate set to 1e-6.
    Preliminary Results
    Figure 19: Validation Loss
    Model 2: 4 Hidden Layers with BERT embeddings with Learning rate set to 1e-5.
    Preliminary Results
    Figure 20: Validation Loss
    Preliminary Results
    Figure 21: Validation Loss
    Model 3: 4 Hidden Layers with BERT embeddings with Learning rate set to 1e-4.
    Preliminary Results
    Figure 22: Validation Loss
    Preliminary Results
    Figure 23: Validation Loss
    Model 4: 1 Hidden Layer with BERT embeddings , Learning rate set to 1e-5 and number of neurons set to 5000.
    Preliminary Results
    Figure 24: Validation Loss
    Preliminary Results
    Figure 25: Validation Loss
    Model 5: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 3000 and L2 regularization(1e-5).
    Preliminary Results
    Figure 26: Validation Loss
    Preliminary Results
    Figure 27: Validation Loss
    Model 6: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 1024 and L2 regularization(1e-5).
    Preliminary Results
    Figure 28: Validation Loss
    Model 7: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 1024 and L2 regularization(1e-5) only on training loss.
    Preliminary Results
    Figure 29: Validation Loss
    Model 8: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 512 and L2 regularization(1e-5).
    Preliminary Results
    Figure 30: Validation Loss
    Model 9: 1 Hidden Layer with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 256 and L2 regularization(1e-5).
    Preliminary Results
    Figure 31: Validation Loss
    Model 10: 2 Hidden Layers with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 128 and L2 regularization(1e-5).
    Preliminary Results
    Figure 32: Validation Loss
    Model 11: 1 Hidden Layers with BERT embeddings, drop out layers(0.5) , Learning rate set to 1e-5 and number of neurons set to 128 and L2 regularization(1e-5).
    Preliminary Results
    Figure 33: Validation Loss

    Accuracy Analysis:

    The baseline accuracy is around 33%, and the final accuracy of the model with BERT is 35.8%. We see only a slight increase in the accuracy. Since the visually impaired people take pictures as well as ask questions themselves, the dataset being used is a challenging dataset for modern vision algorithms. It imposes the following challenges for model's training:

  • Images are often of poor quality due to low focus and poor lighting, which might lead to less accurate results.
  • Questions are on average more conversational/absurd and in some cases are incomplete due to audio recording imperfections.
  • Visually impaired people are also not able to verify the correctness of image captured which sometimes may lead to mismatch between image and question asked.
  • We are using pre-trained CNN architectures such as Inceptionv3 and ResNet for creating image features and pretrained BERT sentence transformers to create text features. We have not tuned these features to our dataset. And this might lead to a slightly lower accuracy than expected.
  • We attribute the poor generalisation of this algorithm largely to its inability to predict answers observed in the dataset because only 824 out of the top 3000 answers in Vizwiz are included in the dataset used to train the model.
  • Conclusion and Future Scope

    In this way our project aims at developing a software framework that can detect objects from images and then answer questions based on the content of those images. From a big picture perspective, this project provides high-level contextual information about the surroundings to the visually impaired.

    Following are some ways we can work on the future aspects of the project:

  • To improve the dataset, data augmentation can be performed, such as adding data from other datasets. We can also work on adding further images using translation, rotation, flipping etc. And work on adding questions and answer sets in a similar way.
  • Once accuracy is improved, our model can be utilized by the visually impaired for navigation purposes and performing daily chores. We can also request their feedback and input on developing the model
  • Currently the dataset includes static images. In the future, video based analysis can be done, to make the system applicable to dynamic real time use cases.
  • Our project currently takes input (questions) in a text format and delivers output (answers) also in the text format. In the future, for the ease of the visually impaired people, there needs to be a speech to text system in place which can convert the questions asked by the blind people to text and there should also be a text to speech system which can convert the answers back to speech and inform the blind people.
  • References

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425-2433).
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
  • Ilievski, I., & Feng, J. (2017). A simple loss function for improving the convergence and accuracy of visual question answering models. arXiv preprint arXiv:1708.00584.
  • Dushi, D. (2019). Using Deep Learning to Answer Visual Questions from Blind People.
  • Zeeshan Saquib, et. al.BlinDar: An Invisible Eye for the Blind People, IEEE International Conference On Recent Trends In Electronics Information Communication Technology, India, May 2017.
  • ImageNet
  • Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks