The Future of Natural Language Processing
ChatGPT and similar generative language models have been attracting a lot of attention recently. The trouble is, that while they’re good at producing fluent text, they don’t necessarily produce accurate or useful text. With ChatGPT, the fact that it admits that it doesn’t know the answer some of the time produces a false expectation that it knows what it’s talking about the rest of the time, but if you ask it questions about a subject you know about, you’ll find it makes mistakes ranging from the subtle to the absurd. CNET found the hard way that generative models are not a reliable source of content. The reason for this is that the text they generate is based on statistical patterns inferred from their training datasets. At no stage in the process does the model actually understand either the text it’s been trained on or what it is being asked to do. It is surmised that in a sufficiently complex model, such understanding may arise as an emergent property of the network, but even if it does, large language models are generally trained on text harvested from the internet, thus leading to a garbage-in garbage-out problem.
This means that the most likely use of generative language models in the near term is as an efficient source of clickbait and fake news. This makes GPTZero and OpenAI’s own AI-written text classifier important. Search engines will need to incorporate tools like these to ensure that results are more likely to come from reliable sources.
However, it clearly isn’t enough to trust the neural network. Future generations of NLP models will need to incorporate knowledge and a concept of logical consistency, so that they can discriminate truth from falsehood. My own work with True 212 used WikiData as a knowledge base for Named Entity Recognition with good effect, so I know how powerful the incorporation of a good knowledge base can be. However, if we want the system to be able to learn and grow its own knowlege base, it needs to understand whether or not data is logically consistent. We can envisage a model that vectorizes statements in such a way that for two statements that are logically consistent, the cosine similarity of the vectors is close to 1, for two statements that are inconsistent, the cosine similarity is close to -1, and for two statements that are unrelated, the cosine similarity is close to zero. The Stanford Natural Language Inference Corpus, available from Kaggle, would be a suitable dataset to train this on. Once we could predict logical consistency in this way, we should be able to boostrap a knowledge base from a corpus of trusted facts by adding only statements that are consistent with what is already known.
These vectors have the property that arithmetical negation corresponds to logical negation. It’s possible, therefore, that we could perform logical inference by means of arithmetical operations of the vectors. The sum of two vectors may correspond to a logical syllogism, allowing the system to deduce new facts from its knowledge base.
A system that could model consistency would have a lot of powerful applications. Fake News Detection is one possibility - if a document repeatedly contradicted trusted sources, it could be classified as unreliable. Conversely, a document would also be suspicious if it made similar claims to sources known to be unreliable - the QAnon conspiracy theory made similar claims to The Secret History - smear campaigns and scare stories haven’t changed much since Roman times. Used alongside anomaly detection, it could also detect when an author had concealed dubious claims in an otherwise factual document. However, it could also be a proof-reading tool, allowing authors and editors to check their work for errors more efficiently.
It would also be able to detect opinion and partisanship. Suppose two sources both make claims A and B. However, one source also makes claim C and the other makes claim D. While neither of C or D is inconsistent with A or B, they are inconsistent with each other. We can therefore deduce that A and B are more likely to be accepted by consensus as fact, whereas C and D are opinions. Clustering sources by which opinions they were likely to share would identify partisan groups of sources.
These are just a few possible applications - the ones that occur to me off the top of my head - but they clearly show that knowledge, consistency and reasoning are the missing ingredients needed to make NLP technology truly useful.