Innovative use of machine learning in extracting attitudes towards vaccination from Russian social media

This project has been shortlisted for the DPH Innovation Prize – Best Data Driven Innovation


Team: Daria Tserkovnay (The Vaccine Confidence ProjectTM)

Outline: The issue of the decreasing confidence in vaccines is currently widespread globally and leads to the epidemic flares of the vaccine preventable diseases. It is fueled by the debate on the internet, by the anti and pro vaccine advocates.

The algorithm was written using Python to serve the purpose of investigating vaccine-related attitudes online. By understanding the language of vaccine confidence debate, it can classify the sentiment polarity of individual items. Negative and positive sentiment assignment is an issue in the vaccine confidence field of research due to the meaning between positive and negative posts being easily mixed when pro-vaccine advocates condemn anti-vaccine attitudes or vice-versa. There is possibility of a common standardized sentiment analysis algorithm misclassifying the sentiment due to the unique lingo used in the vaccine-related debate. 
The algorithm has the potential to be applied on a much larger dataset. It is an innovation that can replace the human input data. With more extensive trials on inter and intra coder reliability this algorithm could be applied to build a more cognitive AI that could be applied or run on a larger dataset and is tailored to but not restricted to the topic of the vaccine confidence and misinformation, therefore the approach has the potential for the generalization. 

The algorithm was trained on a unique dataset containing data mined out of social networks. Due to the nature of text extracted from such source, it would not be possible to use traditional sources of text to train the algorithm. The colloquial language, use of shorthands, hashtags and mention of other user and groups are all unique features of the text mined from social media. Additionally, due to the nature of the topic, normal sentiment datasets would not have optimal results. Therefore a carefully selected dataset which was amassed from social media sources was used to train the algorithm and overcome these two main obstacles. 

Using a Support Vector Machine trained on a unique data set, we were able to apply it to the main dataset containing social media posts and get sentiment analysis of the largest Russian language speaking social network and the accuracy of the model has achieved above 90%. Results have shown the distribution of negative and positive sentiment, and the results of this study has enabled the scientific community to discover the attitudes towards vaccination of the Russian speaking population in the way that was not applied previously.

 There is an opportunity for development the algorithm further for higher sensitivity in similar research and the potential of application in other topics of health-related sentiment. The impact of its application would provide the possibility of the high sensitivity recognition of the relevant posts and more correct assignment of the sentiment towards the vaccines, as the tailored algorithm is essential considering the nature of the niche field of the aforementioned research and the unique lingo used in the online discussion when addressing the topic.