Text Matching API


Analytics Zoo provides a pre-defined KNRM model that can be used for text matching (e.g. question answering). More text matching models will be supported in the future.

Highlights

  1. Easy-to-use Keras-Style defined model which provides compile and fit methods for training. Alternatively, it could be fed into NNFrames or BigDL Optimizer.
  2. The model can be used for both ranking and classification tasks.

Build a KNRM Model

Kernel-pooling Neural Ranking Model with RBF kernel. See here for more details.

You can call the following API in Scala and Python respectively to create a KNRM with pre-trained GloVe word embeddings.

Scala

val knrm = KNRM(text1Length, text2Length, embeddingFile, wordIndex = null, trainEmbed = true, kernelNum = 21, sigma = 0.1, exactSigma = 0.001, targetMode = "ranking")

Python

knrm = KNRM(text1_length, text2_length, embedding_file, word_index=None, train_embed=True, kernel_num=21, sigma=0.1, exact_sigma=0.001, target_mode="ranking")

Pairwise training

For ranking, the model can be trained pairwisely with the following steps:

  1. Read train relations. See here for more details.
  2. Read text1 and text2 corpus as TextSet. See here for more details.
  3. Preprocess text1 and text2 corpus. See here for more details.
  4. Generate all relation pairs from train relations. Each pair is made up of a positive relation and a negative one of the same id1. During the training process, we intend to optimize the margin loss within each pair. We provide the following API to generate a TextSet for pairwise training:

Scala

val trainSet = TextSet.fromRelationPairs(relations, corpus1, corpus2)

Python

train_set = TextSet.from_relation_pairs(relations, corpus1, corpus2)

Call compile and fit to train the model:

Scala

val model = Sequential().add(TimeDistributed(knrm, inputShape = Shape(2, text1Length + text2Length)))
model.compile(optimizer = new SGD(learningRate), loss = RankHinge())
model.fit(trainSet, batchSize, nbEpoch)

Python

model = Sequential().add(TimeDistributed(knrm, input_shape=(2, text1Length + text2Length)))
model.compile(optimizer=SGD(learning_rate), loss='rank_hinge')
model.fit(train_set, batch_size, nb_epoch)

Listwise evaluation

Given text1 and a list of text2 candidates, we provide metrics NDCG and MAP to listwisely evaluate a ranking model with the following steps:

  1. Read validation relations. See here for more details.
  2. Read text1 and text2 corpus as TextSet. See here for more details.
  3. Preprocess text1 and text2 corpus same as the training phase. See here for more details.
  4. Generate all relation lists from validation relations. Each list is made up of one id1 and all id2 combined with id1. We provide the following API to generate a TextSet for listwise evaluation:

Scala

val validateSet = TextSet.fromRelationLists(relations, corpus1, corpus2)

Python

validate_set = TextSet.from_relation_lists(relations, corpus1, corpus2)

Call evaluateNDCG or evaluateMAP to evaluate the model:

Scala

knrm.evaluateNDCG(validateSet, k, threshold = 0.0)
knrm.evaluateMAP(validateSet, threshold = 0.0)

Python

knrm.evaluate_ndcg(validate_set, k, threshold=0.0)
knrm.evaluate_map(validate_set, threshold=0.0)

Examples

We provide an example to train and evaluate a KNRM model on WikiQA dataset for ranking.

See here for the Scala example.

See here for the Python example.