M4

Did It Work?

Eval harness

An evaluation harness that measures ranking quality with labeled queries

Planned Cohort
Effort
To be published
Prerequisite
M3
Core concept
nDCG, MRR, precision@k

What you have

Ranked results with no way to measure quality

What you gain

Measurable retrieval quality on a labeled dataset

What you build

The module is planned. The core shape is fixed: ranking outputs from M3 will feed a repeatable evaluation harness.

  • An evaluation runner that loads labeled queries and expected documents
  • Metric functions for nDCG, MRR, and precision@k
  • A report file that compares ranking runs over the same dataset

What you learn

  • How retrieval metrics capture different kinds of ranking quality
  • Why a single score is not enough to explain search behavior
  • How labels, judgments, and cutoffs shape the meaning of the results

Artifact and workload

Primary artifact: Evaluation pipeline with metrics and reporting

TestsTo be published
AssessmentsTo be published
Estimated timeTo be published

Access

This module is planned. Join the waitlist to hear when dates and access details are published.

Join waitlist