Inter-Rater Agreement Analysis

The Challenge

When a doctor develops a new diagnostic model, it’s crucial to validate how consistently multiple physicians apply it. We needed to measure the level of agreement between multiple doctors when classifying patient diagnoses.

Our Solution

We implemented Fleiss’ kappa analysis in R to statistically measure inter-rater reliability among multiple physicians evaluating patient cases.

Why Fleiss’ Kappa?

Unlike Cohen’s kappa (which only supports two raters), Fleiss’ kappa is designed for:

Multiple raters (we had 4 physicians)
Categorical classification data
Statistical validation of agreement levels

Technical Implementation

We used R with Jupyter notebooks for reproducible analysis:

data <- read.csv("specific.csv", header=FALSE)
kappam.fleiss(data, exact=FALSE, detail=TRUE)

The results showed:

173 subjects evaluated
4 raters participating
Kappa = 0.788 indicating substantial agreement
Detailed breakdown by diagnosis category

Data Preparation

Our Python preprocessing pipeline:

Transformed raw data from per-patient-per-doctor rows to matrix format
Created separate analyses for three different assessment types
Automated the data normalization process

Results

Validated strong agreement (kappa = 0.788) across physicians
Identified specific diagnoses with highest/lowest agreement
Provided statistical confidence (p-value < 0.05) that agreement wasn’t chance
Delivered actionable insights within hours of receiving data

Technologies Used

R with irr package for kappa calculations
Python for data preprocessing
Jupyter notebooks for reproducible analysis
Conda for environment management

stats medical r

← Back to Portfolio