M4
Did It Work?
Eval harness
An evaluation harness that measures ranking quality with labeled queries
Planned
Cohort
- Effort
- To be published
- Prerequisite
- M3
- Core concept
- nDCG, MRR, precision@k
What you have
Ranked results with no way to measure quality
What you gain
Measurable retrieval quality on a labeled dataset
What you build
The module is planned. The core shape is fixed: ranking outputs from M3 will feed a repeatable evaluation harness.
- An evaluation runner that loads labeled queries and expected documents
- Metric functions for nDCG, MRR, and precision@k
- A report file that compares ranking runs over the same dataset
What you learn
- How retrieval metrics capture different kinds of ranking quality
- Why a single score is not enough to explain search behavior
- How labels, judgments, and cutoffs shape the meaning of the results
Artifact and workload
Primary artifact: Evaluation pipeline with metrics and reporting
TestsTo be published
AssessmentsTo be published
Estimated timeTo be published
Access
This module is planned. Join the waitlist to hear when dates and access details are published.