Inter-Rater Agreement Analysis
The Challenge
When a doctor develops a new diagnostic model, it’s crucial to validate how consistently multiple physicians apply it. We needed to measure the level of agreement between multiple doctors when classifying patient diagnoses.
Our Solution
We implemented Fleiss’ kappa analysis in R to statistically measure inter-rater reliability among multiple physicians evaluating patient cases.
Why Fleiss’ Kappa?
Unlike Cohen’s kappa (which only supports two raters), Fleiss’ kappa is designed for:
- Multiple raters (we had 4 physicians)
- Categorical classification data
- Statistical validation of agreement levels
Technical Implementation
We used R with Jupyter notebooks for reproducible analysis:
data <- read.csv("specific.csv", header=FALSE)
kappam.fleiss(data, exact=FALSE, detail=TRUE)
The results showed:
- 173 subjects evaluated
- 4 raters participating
- Kappa = 0.788 indicating substantial agreement
- Detailed breakdown by diagnosis category
Data Preparation
Our Python preprocessing pipeline:
- Transformed raw data from per-patient-per-doctor rows to matrix format
- Created separate analyses for three different assessment types
- Automated the data normalization process
Results
- Validated strong agreement (kappa = 0.788) across physicians
- Identified specific diagnoses with highest/lowest agreement
- Provided statistical confidence (p-value < 0.05) that agreement wasn’t chance
- Delivered actionable insights within hours of receiving data
Technologies Used
- R with irr package for kappa calculations
- Python for data preprocessing
- Jupyter notebooks for reproducible analysis
- Conda for environment management