QARAC: Models and Corpora
I’ve made some early progress on developing QARAC, and I’m not far from being able to make a first attempt at training it. I’ve chosen base models, coded the model heads and the training model, and found appropriate datasets to train on.
Models
Base models
I was initally interested in using Hyena models as my base models and training them with the British National Corpus. However, I found it harder to implement Hyena models in Keras than I anticipated, and didn’t want this to be a roadblock. I’ve therefore decided to start by using RoBERTa. However, I may need to consider another model for the decoder.
Model Heads
For the encoder models, the head used is a Global Attention Pooling Head. If attention in a transformer model is the relevance of each word in a document to the meaning of each other word, global attention may be defined as the relevance of each word to the overall meaning of the document. This is calculated as follows
Given the contextual word vectors $\vec{v_{i}}$ produced by the base encoder model, and two trainable matrices $\mathbf{L}$ and $\mathbf{G}$, define the local projection \(\vec{l_{i}} = \vec{v_{i}} \cdot \mathbf{L}\) and the global projection \(\vec{g} = \left( \sum_{i} \vec{v_{i}} \right) \cdot \mathbf{G}\). The attention is then calculated as the cosine similarity of the two projections \(a_{i} = \hat{l_{i}} \cdot \hat{g}\). Finally, the encoded vector is calculated as the sum of the word vectors weighted by the attention \(\vec{E} = \sum_{i} a_{i} \vec{v_{i}}\).
For the decoder models, the head used is a QaracDecoderHead. This prepends a vector representing an encoded document to the vectors generated by the base model, passes this through a TFRobertaLayer
, removes the first vector from the output of that layer, then feeds that through another TFRobertaLayer
and finally a TFRobertaLMHead
, returning the output of that layer.
The Training Model
To prevent catastrophic forgetting, the question encoder, answer encoder and decoder must all be trained together, targeting all training objectives simultaneously. To do this, they are combined into a Trainer Model. Given a sentence $\mathbf{S}$, a question $\mathbf{Q}$, an answer $\mathbf{A}$, two propositions $\mathbf{P_{0}}$ and $\mathbf{P_{1}}$, and two statements $\mathbf{s_{0}}$ and $\mathbf{s_{1}}$, the following outputs are calculated
\[\texttt{encode_decode} = \mathcal{D}(\mathcal{AE}(\mathbf{S}))\] \[\texttt{question_answering} = \mathcal{QE}(\mathbf{Q}) - \mathcal{AE}(\mathbf{S})\] \[\texttt{reasoning} = \mathcal{D}(\mathcal{AE}(\mathbf{P_{0}}) + \mathcal{AE}(P_{1})\] \[\texttt{consistency} = \mathit{cossim}(\mathcal{AE}(\mathbf{s_{0}}),\mathcal{AE}(\mathbf{s_{1}}))\]For the decoding and question answering objectives, the loss to be minimised is the sparse categorical crossentropy of the generated answer against the answer in the training set. For question answering, it is the squared Eudlidean length of the vector produced, and for consistency is the mean squared error from the desired label (1 for consistent statements, -1 for contradictory statements, 0 for unrelated statements).
The output for question answering and its associated loss are chosen to reflect the intended use of the question encoder, to generate a query vector for a vector database.
Training Corpora
Question Answering
For Question Answering, the most suitable corpus I have found is the WikiQA dataset. This contains a sample of questions obtained from Bing queries, along with the first paragraph of a Wikipedia article relevant to each question. The paragraph is split into sentences, one per line, and the sentences are labelled 1 if they are considered a valid answer to the question, and 0 otherwise. The rows labelled 1 will be used to train the question answering objective.
It has been necessary to perform coreference resolution on this dataset, for which AllenNLP was used. Since it was necessary to combine all the sentences for a given question into a single document to perform coreference resolution and then separate them afterwards, some rather nasty edge cases had to be dealt with.
Reasoning
For Reasoning, the Avicenna: Syllogistic Commonsense Reasoning dataset will be used. This contains pairs of sentences, a label “yes” if they can be used to form a valid syllogism and “no” if not, and a conclusion to the syllogism if it exists. Only the examples where a valid syllgism exists will be used to train the dataset.
Consistency
For Consistency, the Stanford Natural Language Inference Corpus will be used. This contains pairs of sentences, labelled as “entailment”, “contradiction” or “neutral”. These values will be mapped to +1, -1 and 0 respectively.
Encode/Decode
To train the decoding of encoded sentences, a combined dataset consisting of
- all the answer sentences from the WikiQA dataset, whether they are labelled as correct or not
- all the the propositions from the Avicenna dataset, whether there is a valid conclusion or not
- the conclusions from the Avicenna dataset, where these are available
- the sentences from the SNLI corpus will be used.