True 212

The Client

The Problem

True 212 wanted to identify relevant content to link to from their news and culture blogs. They believed that a simple bag-of-words approach would lead to naive matches, and wished to extract semantics from the documents to enable richer matches.

The Approach

A NLP pipeline was created with the following stages.

A Named Entity Recognition system that identified candidate named entities in a document and found corresponding WikiData entities. Known relationships between WikData entities were used to disambiguate candidate matches.

A Part of Speech Tagger that used Hidden Markov Models to return a the probability distribution over the part of speech categories used in WordNet for each word in a sentence.

A Word Sense Disambiguation component that used the Viterbi algorithm to find the maximum likelihood sequence of WordNet IDs corresponding to the words in a given sentence, allowing for stopwords, multiword expressions, named entities and out-of-vocabulary words. This achieved state-of-the-art accuracy (70%).

A Latent Semantic Indexing model which was trained on the semantically enhanced documents to perform rich matching.

Technology Used

by Dr Peter Bleackley

Case Studies

Case Studies -True 212

Dr Peter Bleackley's Portfolio

True 212

The Client

The Problem

The Approach

Technology Used