The Entropy of Alice in Wonderland
Several years ago. I read in New Scientist about an information theory based technique for identifying the most significant words in a document, according to the role they play in its structure. After looking up the paper, Towards the quantification of semantic information in written language by Marcello Montemurro and Damian Zanette, I implemented the algorithm and contributed it to Gensim. Unfortunately, it’s no longer in the latest release, but I have created a fork of Gensim to allow further development of features that have been dropped from the latest release.
When I found the text of Alice’s Adventures in Wonderland as a Kaggle Dataset, it provided the opportunity to create a demonstration for the algorithm.
I also created a video explaining it.