PHQ-9 is a nine-item validated screening instrument developed for primary care settings to identify and monitor depression severity. It is widely used, well-validated, brief to administer, and it was designed for a specific purpose: screening and monitoring in generalist settings. It was not designed to capture the full range of outcomes relevant to CBT delivery in digital health, and using it as the primary or sole outcome measure for a CBT program creates a measurement gap that matters for clinical quality and value-based contracting alike.
This is not a critique of PHQ-9 as an instrument. It is a critique of measurement strategies that reduce CBT outcomes to a single scalar score, then make contracting and clinical quality claims on that basis. If you are evaluating digital CBT vendors whose evidence presentations lead with PHQ-9 score changes and stop there, you are seeing an incomplete picture.
What PHQ-9 Measures and What It Doesn't
The PHQ-9 measures nine symptom domains derived from the DSM criteria for major depressive episode: depressed mood, anhedonia, sleep disturbance, fatigue, appetite change, worthlessness/guilt, concentration difficulty, psychomotor changes, and suicidal ideation. Scores sum to 27; clinical thresholds commonly used are minimal (0-4), mild (5-9), moderate (10-14), moderately severe (15-19), and severe (20-27).
What PHQ-9 measures well: symptom severity at a point in time, longitudinal severity tracking when administered consistently, and crude treatment response (defined as approximately 5-point reduction). What PHQ-9 does not measure: functional impairment (though the PHQ-9 does include a question about interference with daily activities, it is the least-weighted clinically), cognitive change (does the patient think differently about themselves and their future), behavioral activation (is the patient engaging in more activities associated with positive affect), relapse risk factors (residual symptoms that predict later recurrence), or patient-reported experience of treatment (did the patient find the intervention helpful and would they continue).
For CBT specifically, the theory of change involves cognitive and behavioral mechanisms that should show up in measures beyond symptom scales. A patient who learns to identify and challenge automatic negative thoughts has acquired a skill — and that skill acquisition matters for durable outcomes independent of whether it is immediately reflected in PHQ-9 scores. Symptom reduction without skill acquisition in CBT tends to produce lower durability.
The Functional Impairment Question
Depression and anxiety impose economic and functional costs that are captured poorly by symptom scales. The Sheehan Disability Scale (SDS), the Work and Social Adjustment Scale (WSAS), and the WHO Disability Assessment Schedule (WHO-DAS 2.0) all measure functional impairment across domains including work, social life, and daily activities with more granularity than the PHQ-9 item 10 "How difficult have these problems made it for you to do your work, take care of things at home, or get along with other people."
For health plans and employers evaluating digital CBT programs, functional outcome measures carry more direct actuarial relevance than symptom scales. A 5-point PHQ-9 reduction is clinically meaningful, but it does not directly translate to absenteeism rates, presenteeism indices, or short-term disability claims — which are the cost drivers that make behavioral health investment decisions. Digital CBT vendors that track functional outcomes alongside symptom outcomes have a more complete and more commercially relevant evidence picture.
We track both PHQ-9 (depression) and GAD-7 (anxiety) as symptom measures, with WSAS as the primary functional impairment measure, administered at intake, four weeks, and eight weeks. For employers specifically, we report WSAS work subscale scores separately, as that dimension is most directly relevant to productivity-related cost calculations. We're not claiming our WSAS data is perfect — dropout before eight weeks means our follow-up data skews toward engaged users — but collecting it distinguishes symptom change from functional change and creates a more honest basis for outcome claims.
CBT-Specific Skill Acquisition Measures
If CBT's mechanism of action involves skill acquisition — learning to identify cognitive distortions, practicing thought records, building behavioral activation habits — then measuring skill acquisition is measuring whether the therapy is doing what it's supposed to do. Symptom improvement without skill acquisition may reflect nonspecific factors (attention, expectancy, behavioral activation from regular check-ins) that are less durable than genuine cognitive change.
Several validated instruments target CBT skill acquisition directly. The Cognitive-Behavioral Therapy Skills Questionnaire (CBTSQ) measures four skill domains: cognitive restructuring, behavioral activation, emotion regulation, and mindfulness. The Revised Cognitive Therapy Scale (CTS-R) was developed for supervisor-rated therapist competence but includes items adaptable for patient self-report on skill use between sessions.
More practically for digital health administration, session-embedded behavioral markers can proxy skill acquisition: thought record completion rates, between-session homework engagement rates, and prompted self-assessment of skill confidence (on a simple 0-10 scale) provide longitudinal signal about whether users are actually learning CBT skills rather than just using the app as a check-in mechanism. These metrics are producible from session data without adding survey burden and give the clinical team a leading indicator of therapeutic progress that symptom scales can miss, particularly in the early sessions before symptom change shows up.
The Attrition Problem in Outcome Reporting
This is where digital CBT outcome reporting gets difficult to evaluate: the vast majority of digital health outcome studies, including app RCTs and vendor case studies, report outcomes only for users who completed the measurement period. Dropout rates for digital mental health apps in the 8-12 week timeframe typically run between 50% and 80% depending on population and product. If you see an outcome study reporting PHQ-9 change at 8 weeks and the sample size at 8 weeks is 30% of the intake sample, the reported outcome is not representative of the population that used the product.
Intent-to-treat analysis — calculating outcomes based on all enrolled users, carrying forward last observation for dropouts — is the standard for RCTs and should be demanded for vendor outcome claims as well. Vendors who report outcomes only on completers are almost certainly overstating clinical effect sizes. The direction is clear and reliable: completers do better than dropouts, so completer-only analyses systematically inflate results.
We report both completer and intent-to-treat analyses for our outcome data, and the ITT numbers are lower than the completer numbers — as they should be. Any vendor whose outcome presentation doesn't show this difference either doesn't track dropout (a data quality problem) or is selectively reporting (a transparency problem). Neither answer is reassuring.
Relapse and Durability Metrics
CBT has a well-established durability advantage over pharmacotherapy for depression and anxiety: skills learned in CBT provide some protection against relapse even after formal treatment ends, while discontinuing medication more often leads to symptom return. This is one of CBT's most clinically significant properties, and it is one that digital health outcome measures almost never capture because it requires 6-12 month follow-up that is logistically difficult to conduct for an app-based product.
The practical alternative for vendors and buyers: relapse-predictive residual symptom measures at end of treatment. Patients who complete an 8-week CBT program with PHQ-9 scores in the minimal range (0-4) have better 6-month trajectories than those who end treatment with PHQ-9 scores in the mild range (5-9), even if both represent meaningful improvement from intake. Tracking the distribution of end-of-treatment PHQ-9 scores — not just mean change — gives a more informative picture of population-level durability potential.
We also track the ISI (Insomnia Severity Index) at end-of-treatment as a relapse predictor, given the well-documented relationship between residual insomnia and depression recurrence discussed in our earlier piece on sleep comorbidity. A user who ends treatment with normalized depression scores but elevated insomnia is a different clinical risk profile than one whose sleep has also improved.
What This Means for Value-Based Contracting
Health plans and employers increasingly want outcome-linked contracting for digital mental health vendors. The challenge is that outcome-linked contracts require outcome definitions, and PHQ-9-only definitions are too easily gamed and too limited in their relevance to the cost drivers purchasers actually care about.
A more complete outcome framework for contract purposes might include: symptom response rate (5-point PHQ-9 reduction) at 8 weeks on intent-to-treat basis; functional improvement rate (WSAS reduction of 10+ points) at 8 weeks; engagement completion rate (percentage of enrolled users completing at least 6 sessions); and escalation rate (percentage of users appropriately referred to human clinical services). These four metrics together characterize a digital CBT program more completely than any single score — they capture symptom change, functional relevance, program completion, and appropriate scope management.
Vendors who push back on multi-metric outcome definitions with "PHQ-9 is the standard" are often doing so because their numbers on functional outcomes or attrition-adjusted symptom change are less favorable than their completer-only PHQ-9 data. That pushback is informative. The measurement conversation is one of the most reliable indicators of whether a vendor has actually interrogated their clinical evidence or is presenting it selectively.