M4
Did It Work?
Eval harness
An evaluation harness that measures ranking quality with labeled queries
Available
Cohort
- Effort
- 3-5 hours
- Prerequisite
- M3
- Core concept
- nDCG, MRR, precision@k
What you have
Ranked results with no way to measure quality
What you gain
Measurable retrieval quality on a labeled dataset
What you build
This module turns ranking guesses into measurable quality. You build an evaluation harness that loads labeled queries and computes nDCG, MRR, and precision@k across your M3 outputs.
- An evaluation runner that loads labeled queries and expected documents
- Metric functions for nDCG, MRR, and precision@k
- A report file that compares ranking runs over the same dataset
What you learn
- How retrieval metrics capture different kinds of ranking quality
- Why a single score is not enough to explain search behavior
- How labels, judgments, and cutoffs shape the meaning of the results
Artifact and workload
Primary artifact: Evaluation pipeline with metrics and reporting
Tests30
Assessments5
Estimated time3-5 hours
Access
This module is part of the cohort. Join the guided path for reviews, deadlines, and the workshop sequence after the ranking modules.