Data Science Notebooks
The pages in this section describe various projects I have undertaken with publicly avaialble datasets, mostly on Kaggle. They provide an opportunity to see practical demonstrations of my data science work.
- Clustering Proteins in Breast Cancer Patients
- Using the Breast Cancer Proteome dataset, I identified clusters of proteins with related activity, and investigated using them to predict clinical outcomes. Illustrates data reduction, hierarchical clustering, logistic regression and linear regression
- The Entropy of Alice In Wonderland
- Using Montemurro and Zanette’s algorithm to identify significant words and sentences in the text of Alice’s Adventures in Wonderland. Illustrates information theory
- The Grammar of Truth and Lies
- Using grammatical features to distinguish real from fake news. Illustates latent semantic indexing, logistic regression and random forests
- Is It A Mushroom or Is It A Toadstool?
- Predicting whether or not fungi are edible. Illustrates Bayes’ theorem and information theory
- Part of Speech Tagging
- A video, and associated Binder notebook, discussing different approaches to Part of Speech Tagging. Illustrates Hidden Markov Models