Using large language models to create lexicons for interpretable text models with high content validity: the Suicide Risk Lexicon

Key findings

GPT-4-turbo automatically generated a Suicide Risk Lexicon covering 49 risk factors that, after clinical validation, outperformed LIWC and matched some deep-learning models in predicting imminent risk; the released construct-tracker Python package supports building similar interpretable lexicons across domains.

Abstract

Source: openalex

Researchers often want to measure a variety of constructs such as anxiety, discrimination, or loneliness in text data from surveys, interviews, social media, and electronic health records. Using large language models (LLMs) --while optimal for text classification-- remain infeasible for many researchers due to concerns around computational expertise, cost, privacy, and compute requirements. Therefore, some researchers prefer to use lightweight models for large datasets or interpretable models to avoid mistakes in high-stakes scenarios such as suicide risk detection. Lexicons offer simple baselines to LLMs by searching for relevant phrases --and can be used together with LLMs to guarantee capturing specific keywords in a deterministic way. However, building new lexicons is resource intensive. In this study, we found that GPT-4 turbo was able to automatically create a lexicon for 49 known risk factors for suicidal thoughts and behaviors, which we release as the Suicide Risk Lexicon. This approach quickly measures most constructs relevant for this application, resulting in high content validity. This lexicon was able to accurately predict risk in crisis counseling conversations. After validating the lexicon with clinical experts, the lexicon outperformed the LIWC lexicon --which has low content validity for mental illness-- and performed similarly to some black-box deep learning models. Due to using an interpretable approach with high content validity, we discovered that active suicidal ideation and direct self-injury were stronger indicators of imminent risk than passive suicidal ideation and depressed mood in this ecological setting. To simplify creating new lexicons for other research domains, we introduce a Python package, construct-tracker, that works with a variety of LLMs. In sum, while we recommend using LLMs for text classification, they remain out of reach for many researchers. Our work demonstrates that LLMs --despite being black-boxes that might be challenging to use-- can counterintuitively create interpretable models by generating lexicons, when this is preferred. Furthermore, we highlight the broader application of lexicons beyond measurement, including their use in benchmarking LLM performance.

Using large language models to create lexicons for interpretable text models with high content validity: the Suicide Risk Lexicon

Key findings

Abstract

Topics

Associated projects

Lab authors