MAVIS Twitter dataset: A collection of tweets and sentiment analysis in Spanish about vaccines and diseases during the period 2015-2018

González, Alejandro Rodríguez; Tuñas, Juan Manuel; Santamaría, Lucia Prieto; Peces-Barba, Diego Fernandez; Ruiz, Ernestina Menasalvas; Jaramillo, Almudena; Cotarelo, Manuel; Fernández, Antonio J. Conejo; Arce, Amalia; Gil, Angel

doi:10.5281/ZENODO.4335593

MAVIS Twitter dataset: A collection of tweets and sentiment analysis in Spanish about vaccines and diseases during the period 2015-2018

González, Alejandro Rodríguez ¹
Tuñas, Juan Manuel ¹
Santamaría, Lucia Prieto ¹
Peces-Barba, Diego Fernandez ¹
Ruiz, Ernestina Menasalvas ¹
Jaramillo, Almudena ²
Cotarelo, Manuel ²
Fernández, Antonio J. Conejo ³
Arce, Amalia ⁴
Gil, Angel ⁵

1 Universidad Politécnica de Madrid

Universidad Politécnica de Madrid

Madrid, España

ROR https://ror.org/03n6nwv02
2 Global Medical and Scientific Affairs, MSD España
3 Hospital Vithas Xanit Internacional
4 HM Nens
5 Universidad Rey Juan Carlos

Universidad Rey Juan Carlos

Madrid, España

ROR https://ror.org/01v5cv687

Show affiliations +

Editor: Zenodo

Year of publication: 2020

Type: Dataset

CC BY 4.0

DOI: 10.5281/ZENODO.4335593 Open access editor

Abstract

MAVIS dataset comprises a full knowledge base regarding Twitter messages published in Spanish during the period 2015-2018, in the context of sentiment analysis of specific vaccines and their related diseases. Such diseases and vaccines are summarized as follows: Invasive meningococcal disease (“EMI” in Spanish): Bexsero, Trumenba, Nimenrix Invasive pneumococcal disease (“ENI” in Spanish) Influenza Hepatitis Rotavirus: Rotarix, Rotateq Measles (“Sarampión” in Spanish) and MMR (“Triple vírica” in Spanish) Sepsis Whooping cough (“Tosferina” in Spanish) Chickenpox (“Varicela” in Spanish): Varivax, Varilrix; and Shingles (“Zoster” in Spanish) Human papillomavirus infection (“VPH” in Spanish): Cervarix, Gardasil Tweets have been manually classified as having a negative or non-negative sentiment by 5 experts. Moreover, an automatic classification has been performed by 3 different tools: IBM Watson (now Watson Tone Analyzer, https://www.ibm.com/watson/services/tone-analyzer/), Google Cloud Natural Language (https://cloud.google.com/natural-language), and Meaning Cloud (https://www.meaningcloud.com/). IBM Watson and Google Cloud Natural Language returned a numerical sentiment score ranging from -1 to 1, while Meaning Cloud returned a categorical variable with the values ‘P+’, ‘P’, ‘NEU’, ‘N’ and ‘N+’, which were converted to 1, 2, 3, 4 and 5 respectively. With these variables (IBM Watson, Google Cloud Natural Language, and Meaning Cloud annotations and the experts’ classification as the target label), a machine learning metamodel was developed. Tweets were also annotated with the sentiment output given by this classifier. The provided data includes intrinsic tweets information, intrinsic information regarding the users that posted the tweets, the keywords mentioned in each tweet, and the annotations that the experts, the tools, and the model gave to each tweet. <strong>Funding</strong>: This dataset was obtained with funding from MSD, Spain under MAVIS Study (VEAP ID: 7789). <strong>Current studies using this dataset at the moment of the publication</strong>: Rodríguez-González et al., “Creating a metamodel based on machine learning to identify the sentiment of vaccine and disease-related messages in Twitter: the MAVIS study” in 2020 IEEE 33st International Symposium on Computer-Based Medical Systems (CBMS), Jul. 2020, p. 6. DOI: 10.1109/CBMS49503.2020.00053 Rodríguez-González et al., "Identifying Polarity in Tweets from an Imbalanced Dataset about Diseases and Vaccines Using a Meta-Model Based on Machine Learning Techniques" in Applied Sciences, 2020, 10. DOI: 10.3390/app10249019

MAVIS Twitter dataset: A collection of tweets and sentiment analysis in Spanish about vaccines and diseases during the period 2015-2018

Universidad Politécnica de Madrid

Universidad Rey Juan Carlos

Abstract