Skip to main content

How to define a custom evaluator

Key concepts

Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics. These functions can be passed directly into evaluate() / aevaluate().

Basic example

Requires langsmith>=0.2.0

from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]

def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct]
)

Evaluator args

Custom evaluator functions must have specific argument names. They can take any subset of the following arguments:

Python and JS/TS

  • run: langsmith.schemas.Run: The full Run object generated by the application on the given example.
  • example: langsmith.schemas.Example: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).

Currently Python only

  • inputs: dict: A dictionary of the inputs corresponding to a single example in a dataset.
  • outputs: dict: A dictionary of the outputs generated by the application on the given inputs.
  • reference_outputs: dict: A dictionary of the reference outputs associated with the example, if available.

For most use cases you'll only need inputs, outputs, and reference_outputs. run and example are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.

Evaluator output

Custom evaluators are expected to return one of the following types:

Python and JS/TS

  • dict: dicts of the form {"score" | "value": ..., "name": ...} allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.

Currently Python only

  • int | float | bool: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.
  • str: this is intepreted as a categorical metric. The function name is used as the name of the metric.
  • list[dict]: return multiple metrics using a single function.

Additional examples

Requires langsmith>=0.2.0

from langsmith import evaluate, wrappers
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel

# Compare actual and reference outputs
def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]

# Just evaluate actual outputs
def concision(outputs: dict) -> int:
"""Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
return min(len(outputs["answer"]) // 1000, 4) + 1

# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())

async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
"""Use an LLM to judge if the reasoning and the answer are consistent."""

instructions = """\

Given the following question, answer, and reasoning, determine if the reasoning for the \
answer is logically valid and consistent with question and the answer."""

class Response(BaseModel):
reasoning_is_valid: bool

msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
response = await oai_client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
response_format=Response
)
return response.choices[0].message.parsed.reasoning_is_valid

def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct, concision, valid_reasoning]
)

Was this page helpful?


You can leave detailed feedback on GitHub.