M4

Did It Work?

Eval harness

An evaluation harness that measures ranking quality with labeled queries

Available Cohort
Effort
3-5 hours
Prerequisite
M3
Core concept
nDCG, MRR, precision@k

What you have

Ranked results with no way to measure quality

What you gain

Measurable retrieval quality on a labeled dataset

What you build

This module turns ranking guesses into measurable quality. You build an evaluation harness that loads labeled queries and computes nDCG, MRR, and precision@k across your M3 outputs.

  • An evaluation runner that loads labeled queries and expected documents
  • Metric functions for nDCG, MRR, and precision@k
  • A report file that compares ranking runs over the same dataset

What you learn

  • How retrieval metrics capture different kinds of ranking quality
  • Why a single score is not enough to explain search behavior
  • How labels, judgments, and cutoffs shape the meaning of the results

Artifact and workload

Primary artifact: Evaluation pipeline with metrics and reporting

Tests30
Assessments5
Estimated time3-5 hours

Access

This module is part of the cohort. Join the guided path for reviews, deadlines, and the workshop sequence after the ranking modules.

View cohort details