Ethically Developing an African American Tweet Detection Algorithm to Inform Culturally Sensitive Twitter Based Social Support

Full Title: Ethically Developing an African American Tweet Detection Algorithm to Inform Culturally Sensitive Twitter Based Social Support Intervention for Dementia Caregivers


Team Information

Team Members

  • Sunmoo Yoon, Associate Research Scientist in the Department of Medicine, Vagelos College of Physicians and Surgeons, Columbia University


The prevalence of dementia is higher for African Americans than Whites. Although deep learning and other statistical techniques have been widely applied to infer demographic information on Twitter, those demographic detection algorithms tend to be unavailable to open science communities and/or require access to account details that could compromise individuals’ privacy. The purpose of this study is to develop a lexicon-based African American Tweet detection algorithm using artificial intelligence techniques to inform culturally sensitive Twitter based social support intervention for African American dementia caregivers. For our Tweet corpora, we extracted 3,291,101 Tweets using hashtags associated with African American-related discourse (#BlackTwitter, #BlackLivesMatter, #StayWoke) and 1,382,441 Tweets from the nonblack control set (general or no hashtags) from September 1, 2019 to December 31, 2019 using the Twitter API. For our literature corpora, we extracted 14,692 poems and prose writings by African American authors and 66,083 items authored by others as a control, including poems, plays, short stories, novels and essays, using a cloud-based machine learning platform (Amazon SageMaker) via ProQuest TDM studio. We combined statistics from log likelihood and Fisher's exact tests as well as feature analysis of a batch-trained Naive Bayes classifier to select lexicons of terms most strongly associated with the target or control Tweets. A total of 803,495 Tweets (24.41%) associated with African American-related discourse and 369,348 Tweets (26.71%) in the control group were identified as unique and non-bot generated Tweets. The size of the current lexicon developed in this study is 1,735 unigrams for the African American lexicon and 2,267 unigrams for the control set. A lexicon composed of unigrams was more effective at differentiating Tweets from held-out test samples of the two groups. Identifying existing African American communities and discourse patterns on social media platforms like Twitter is the first basic step towards understanding a community and culture necessary to develop culturally sensitive interventions (e.g., terms, norms, culturally sensitive expressions). As a limitation, it is important to consider the ethical issues regarding the use of Twitter data in mental health surveillance. Researchers must be aware of the unresolved distrust towards scientists and health professionals among African Americans in the U.S. due to historical factors (Tuskegee experiment). With ethical concerns in mind, our first version of a lexicon-based African American Tweet detection algorithm developed using literature and Tweet texts can be used both effectively and ethically to inform culturally sensitive Twitter-based social support interventions for African American dementia caregivers and future studies are needed to refine this algorithm.

Team Lead Contact

Sunmoo Yoon:


Harnessing Machine Learning Models to Predict Outcomes in Patients Supported with Extracorporeal Membrane Oxygenation


Comparing Macro, Meso and Micro level Network Structures between Hispanic and Black Dementia Caregiving Networks in Twitter