Using deep learning to identify semantically similar questions
Introduction
A common problem I’m faced with in my day-to-day life is that of knowledge retrieval. A specific question arises, and the answer resides somwhere in a knowledge repository. If this repository happens to be indexed by Google, the solution is simple: type a query into the google search bar. However, there are many repositories that are not indexed by google: My personal email, various knowledge bases I use daily (e.g. Confluence, Stack Overflow). Typically, the search functionality provided by these tools leave much to be desired.
Furthermore, websites like Stack Overflow often suffer from a large number of duplicate questions, along with a large of number of duplicate answers - and often links between them. Assuming innocent motivations, it seems probable that people don’t post duplicate questions because they like annoying the SO administrators, but simply cannot find the information they are looking for - even though it exists.
One potential solution to this problem is to provided or designate a set of canonical questions and answers and simply map other queries into the pre-existing set of questions and answers. Clearly there are many potential user experience issues you might encounter with such an approach. Leaving those aside, this seemed like an interesting problem to attempt to solve with Machine Learning.
Problem analysis
There are a few reasons why this is a nice problem to solve with Machine Learning:
- It’s a well researched problem, and has both classical ML solutions and Neural solutions.
- State of the art approaches yield good results.
- A variety of datasets are available, ranging from small to large.
- Simple approaches like the Manhattan LSTM are quite accessible even for an ML beginner.
- A ML based approach can augment a search engine like ElasticSearch for even better results.
Model & Methodology
As I pointed out above, the Manhattan LSTM model is a simple way to get started. The paper explains the relatively simple approach:
- The model is a “Siamese LSMT.” There are two LSTMs, termed and , with the weights being the same between them.
- The cost function is the negative exponent of the norm of the LSTM outputs:
)
- The authors use pre-trained word embeddings (word2vec) as input.
- The model weights are initialized to random gaussian values.
- Finally, the model is pre-trained on the SICK dataset and trained on the Quora question answering dataset.
- The authors use the AdaDelta optimizer to train the model & employ early stopping.
For our approach, we start with a more basic and slightly different approach. First, we use Keras and GloVe word embeddings. GloVe is similar to word2vec, but with better performance. Keras is a high-level API built on top of TensorFlow to allow rapid development. As shown below, creating an LSTM model that runs on the GPU is straightfoward:
Setting up the model:
One thing to note is that the Keras optimizers are not capable of sparse gradient upddates, so we fall back to a TensorFlow optimizer. This is important because backprop on dense matrices incurs a large performance penality. We also choose to use Stochastic Gradient Descent (for now):
Training
The full notebook can be found on github. The notebook along
with the pre-trained word vectors was copied to an AWS EC2 p2.large instance, and run in a tensorflow_p36
virtuelenv. Training progress:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Train on 364290 samples, validate on 40000 samples
Epoch 1/10
364290/364290 [==============================] - 773s 2ms/step - loss: 0.0902 - acc: 0.8906 - val_loss: 0.1231 - val_acc: 0.8329
Epoch 2/10
364290/364290 [==============================] - 775s 2ms/step - loss: 0.0878 - acc: 0.8950 - val_loss: 0.1262 - val_acc: 0.8283
Epoch 3/10
364290/364290 [==============================] - 775s 2ms/step - loss: 0.0860 - acc: 0.8981 - val_loss: 0.1228 - val_acc: 0.8339
Epoch 4/10
364290/364290 [==============================] - 774s 2ms/step - loss: 0.0845 - acc: 0.9012 - val_loss: 0.1221 - val_acc: 0.8348
Epoch 5/10
364290/364290 [==============================] - 773s 2ms/step - loss: 0.0832 - acc: 0.9036 - val_loss: 0.1226 - val_acc: 0.8341
Epoch 6/10
364290/364290 [==============================] - 773s 2ms/step - loss: 0.0827 - acc: 0.9045 - val_loss: 0.1248 - val_acc: 0.8306
Epoch 7/10
364290/364290 [==============================] - 774s 2ms/step - loss: 0.0889 - acc: 0.8959 - val_loss: 0.1282 - val_acc: 0.8244
Epoch 8/10
364290/364290 [==============================] - 773s 2ms/step - loss: 0.0883 - acc: 0.8968 - val_loss: 0.1246 - val_acc: 0.8316
Epoch 9/10
364290/364290 [==============================] - 777s 2ms/step - loss: 0.0857 - acc: 0.9012 - val_loss: 0.1258 - val_acc: 0.8287
Epoch 10/10
364290/364290 [==============================] - 776s 2ms/step - loss: 0.0845 - acc: 0.9028 - val_loss: 0.1255 - val_acc: 0.8296
Training finished.
10 epochs in 2:09:03.967721
The main thing to note about the above results is the gap in accuracy between the training set and the validation set. This is typically and indication of overfitting (high variance). Hence, we should expect that questions from (or similar to) the questions in the dataset will yield good predictions, but the model won’t generalize well. We ran a simple qualitative experiment:
with the output:
1
2
3
4
5
6
array([[0.14649092],
[0.01732459],
[0.07154535],
[0.7385108 ],
[1. ],
[0.22897404]], dtype=float32)
This simple experiment confirms our suspicions: the model does not generalize well. In the next post, we will look at how we can improve these results.