第6章：评测框架对比 —— DeepEval/Promptfoo/Phoenix/LangSmith/Braintrust/MLflow/RAGAS

读懂 benchmark 和 LLM Judge 之后,真到生产你不会自己写 eval 框架——选对工具直接决定你能不能跑得通、跑得快、能不能集成 CI。本章把 8 个主流 eval 框架横评:DeepEval / Promptfoo / Phoenix / LangSmith / Braintrust / MLflow Scorer / RAGAS / OpenAI Evals——每个给设计哲学、最简代码、能力矩阵,最后给”双栈策略”:CI/CD gate + 观察平台。

1. 框架全景速查

框架	出品方	哲学	GitHub Stars	商业模式
DeepEval	Confident AI	Pytest-style unit test	7K+	开源 + Confident AI 平台
Promptfoo	Promptfoo Inc.	YAML CLI A/B testing	6K+	开源 + Cloud
Phoenix	Arize AI	OTel 观察+ eval	5K+	开源 + Arize Cloud
LangSmith	LangChain	LangChain 一体	—	SaaS(需账号)
Braintrust	Braintrust Inc.	Dataset + eval 一体	—	商业 SaaS
MLflow Scorer	Databricks	统一 scorer API(集成 DeepEval/RAGAS/Phoenix)	18K+(全 MLflow)	开源 + Databricks
RAGAS	Exploding Gradients	RAG 专用 metric	8K+	开源
OpenAI Evals	OpenAI	OpenAI-native	15K+	开源

2. DeepEval — Pytest-style unit test

Confident AI,github.com/confident-ai/deepeval

2.1 哲学

“Pytest for LLMs”——LLM 评测应该像写单元测试一样自然。

2.2 关键能力

能力	说明
14+ metrics	G-Eval、faithfulness、answer relevancy、hallucination、bias、toxicity 等
Pytest 集成	写 test,跑 CI,看 result
Custom metrics	自定义 G-Eval rubric
Multi-turn 对话 eval	支持完整对话历史
Synthetic data	自动生成测试数据
Confident AI Cloud	商业看板(可选)

2.3 最简代码

# test_my_agent.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval, AnswerRelevancyMetric, HallucinationMetric

correctness = GEval(
    name="Correctness",
    criteria="Determine if the actual output factually matches the expected output.",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
    ],
    threshold=0.7,
)

answer_relevancy = AnswerRelevancyMetric(threshold=0.8)

def test_agent_response():
    test_case = LLMTestCase(
        input="北京天气怎么样?",
        actual_output=my_agent.run("北京天气怎么样?"),
        expected_output="约 20°C,晴天",
        retrieval_context=["北京 2026-05-07: 22°C 晴"],
    )
    assert_test(test_case, [correctness, answer_relevancy])

pytest test_my_agent.py -v
deepeval test run test_my_agent.py

2.4 优劣

✅ Pytest 用户零摩擦 ✅ Metrics 覆盖全 ✅ CI 集成最直观 ✅ G-Eval 内置好用

❌ Pointwise 评分 bias(第 4 章已讲) ❌ 大规模 trajectory eval 慢(逐题 LLM 调用)

2.5 适合谁

写单元测试式 eval、CI gating
已有 Pytest 习惯的工程师

3. Promptfoo — YAML A/B testing

github.com/promptfoo/promptfoo

3.1 哲学

“YAML 配 + CLI 跑——prompt 工程师的工作台。“

3.2 最简代码

# promptfooconfig.yaml
prompts:
  - "你是客服。回答:{{question}}"
  - "你是非常专业的客服,简洁回答:{{question}}"

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-haiku-4-5

tests:
  - vars:
      question: "我的订单 #123 还没到"
    assert:
      - type: contains
        value: "订单"
      - type: llm-rubric
        value: "回答应该 empathetic 且建议联系物流"

  - vars:
      question: "我要退货"
    assert:
      - type: cost
        threshold: 0.001  # 控制成本

promptfoo eval
promptfoo view  # 浏览器查看对比

3.3 优劣

✅ Prompt A/B testing 最方便 ✅ CLI / web view 易用 ✅ 多 provider 一键对比

❌ 适合 prompt 维度的 eval,不适合复杂 multi-turn agent ❌ Trajectory 级指标支持弱

3.4 适合谁

Prompt 工程优化
多模型对比选型

4. Phoenix(Arize)— OTel 观察+ eval

Arize AI,github.com/Arize-ai/phoenix

4.1 哲学

“先观察 production,后跑 eval”——把 eval 建在 trace 上。

模块六第 8 章 Observability 已讲过 Phoenix——它的 eval 部分在这里展开。

4.2 关键能力

能力	说明
OTel 原生	接收 OTel trace
观察驱动 eval	从 production trace 提样本评
LLM Judge metrics	hallucination、relevance、toxicity
Drift detection	embedding 分布漂移检测
Phoenix Judge ⭐	内置 judge,与 DeepEval/RAGAS 集成

4.3 最简代码

import phoenix as px
from phoenix.evals import OpenAIModel, HALLUCINATION_PROMPT_TEMPLATE, llm_classify

# 1. 启动 Phoenix(本地或 cloud)
px.launch_app()

# 2. 拉 OTel trace 数据(production)
df = px.get_traces_df()

# 3. 跑 eval
eval_model = OpenAIModel(model="gpt-4o")
hallucination_results = llm_classify(
    dataframe=df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=eval_model,
    rails=["faithful", "hallucinated"],
)

# 4. 把 eval 结果回写 trace
px.log_evaluations(hallucination_results)

4.4 优劣

✅ 生产 trace 即 eval dataset——不需要造数据 ✅ OTel-native,与 LangSmith/Datadog/Tempo 共存 ✅ Drift detection 帮助早期发现退化 ✅ 集成 DeepEval/RAGAS metrics

❌ 学习曲线略高 ❌ 主要针对 production,前期 dev eval 弱

4.5 适合谁

已经在用 OTel 观察 stack
想从 production trace 直接 eval

5. LangSmith eval — LangChain 一体化

LangChain,smith.langchain.com

5.1 哲学

“LangGraph trace + Eval + Dataset 一站式”

5.2 关键能力

自动 trace LangGraph / LangChain
Dataset 管理(从生产 trace 一键转测试集)
内置 eval(criteria-based、custom function、LLM-as-Judge)
Annotation queue(人工标注协作)
Production monitoring(实时仪表盘)

5.3 最简代码

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# 1. 创建 dataset
client.create_dataset(dataset_name="my_eval_set")
client.create_examples(
    inputs=[{"question": "..."}],
    outputs=[{"answer": "..."}],
    dataset_name="my_eval_set",
)

# 2. 定义 evaluator
def correctness_evaluator(run, example):
    return {
        "key": "correctness",
        "score": run.outputs["answer"] == example.outputs["answer"],
    }

# 3. 跑 eval
results = evaluate(
    lambda inputs: my_agent.invoke(inputs),
    data="my_eval_set",
    evaluators=[correctness_evaluator],
)

5.4 优劣

✅ LangChain 栈零摩擦 ✅ Dataset / Eval / Annotation 闭环 ✅ 与 LangGraph trace 深度集成

❌ Vendor lock-in 倾向(虽然 OTel 桥接) ❌ 商业 SaaS,贵

5.5 适合谁

LangChain / LangGraph 重度用户
需要 dataset + eval + annotation 全栈一体

6. Braintrust — 商业专业派

Braintrust,braintrust.dev

6.1 哲学

“专业 LLM 工程团队的统一工作台”

6.2 关键能力

Dataset 管理 + 版本化
实验对比(diff view)
LLM playground
Production logs + eval
Agent eval 专门支持
Strong typing + SDK(TS / Python)

6.3 最简代码

from braintrust import Eval

Eval(
    "MyAgent eval",
    data=lambda: [
        {"input": "Q1", "expected": "A1"},
        {"input": "Q2", "expected": "A2"},
    ],
    task=lambda input: my_agent.run(input),
    scores=[
        # built-in
        ExactMatch,
        # custom
        lambda input, output, expected: {
            "name": "semantic_eq",
            "score": llm_judge(output, expected),
        },
    ],
)

6.4 优劣

✅ 专业级体验,UI 极佳 ✅ Type safety 强 ✅ Agent eval 支持深

❌ 完全商业,无开源免费版 ❌ 适合预算充足的中大型团队

6.5 适合谁

企业级专业 LLM 团队
不想运维 eval infra

7. MLflow Scorer — 统一 scorer API

Databricks,mlflow.org

7.1 哲学

“DeepEval / RAGAS / Phoenix 一统于 mlflow.genai.evaluate”

2025-Q4 MLflow 发布 third-party scorer 集成,把上面所有框架的 metrics 统一到一个 API。

7.2 最简代码

import mlflow
from mlflow.metrics import genai

# 用 DeepEval 的 G-Eval
correctness = genai.deepeval.geval(
    name="Correctness",
    criteria="...",
)

# 用 RAGAS 的 faithfulness
faithfulness = genai.ragas.faithfulness()

# 用 Phoenix 的 hallucination
hallucination = genai.phoenix.hallucination()

# 一个调用跑全部
results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=my_agent,
    extra_metrics=[correctness, faithfulness, hallucination],
)
mlflow.log_metrics(results.metrics)

专为 RAG 系统设计

7+ 个 RAG-specific metrics:

Metric	测什么
Faithfulness	回答是否忠于上下文
Answer Relevancy	回答是否切题
Context Precision	检索上下文是否相关
Context Recall	是否检索到所有相关上下文
Context Entity Recall	实体级召回

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

result = evaluate(
    dataset=ragas_dataset,
    metrics=[faithfulness, answer_relevancy],
)

8.2 OpenAI Evals

OpenAI 官方 eval 框架

YAML 定义 eval,registry 管理:

my_eval:
  id: my_eval.dev.v1
  description: My custom eval
  metrics: [accuracy]

my_eval.dev.v1:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: my_eval/samples.jsonl

oaieval gpt-4o my_eval

适合 OpenAI 栈纯粹用户。

9. 能力矩阵 + 双栈策略

9.1 能力矩阵

维度	DeepEval	Promptfoo	Phoenix	LangSmith	Braintrust	MLflow	RAGAS	OAI Evals
CI/CD gating	★★★★★	★★★★	★★★	★★★★	★★★★	★★★★	★★	★★★
生产 monitoring	★★	★	★★★★★	★★★★★	★★★★	★★★	★	★
Agent / multi-turn	★★★★	★★	★★★★	★★★★★	★★★★	★★★	★★	★★
Dataset 管理	★★★	★★★	★★★★	★★★★★	★★★★★	★★★★	★★★	★★★
A/B testing	★★	★★★★★	★★★	★★★★	★★★★	★★★	★★	★★
OTel 兼容	partial	partial	★★★★★	★★★★	★★★	★★★	partial	✗
学习曲线	低	极低	中	中	中	中	低	中
生态	开源大	开源大	开源大	商业大	商业	极大	中	中

9.2 双栈策略 ⭐

2026 业界共识:没有一个框架满足所有需求,生产团队都用双栈。

        ┌────────────────────────────┐
        │ CI/CD Gating(快、便宜)     │
        │                              │
        │  DeepEval 或 Promptfoo        │
        │  + GitHub Actions             │
        │  → 每个 PR 跑、回归就 block    │
        └────────────────────────────┘

        ┌────────────────────────────┐
        │ Observability + Eval(线上) │
        │                              │
        │  Phoenix / LangSmith /        │
        │  Braintrust + OTel           │
        │  → 生产 trace + drift + 抽审  │
        └────────────────────────────┘

        +(可选)统一层:MLflow Scorer

9.3 选型决策树

你的栈?
│
├─ LangChain / LangGraph 用户
│   └─ DeepEval(CI)+ LangSmith(线上)
│
├─ Pytest 习惯 / Python 重度
│   └─ DeepEval(CI)+ Phoenix(线上)
│
├─ Prompt 工程为主
│   └─ Promptfoo
│
├─ Databricks / MLflow 用户
│   └─ MLflow Scorer 统一栈
│
├─ 企业预算充足 / 要专业
│   └─ Braintrust + Phoenix
│
├─ RAG 重度
│   └─ RAGAS + DeepEval
│
└─ 极简 / OpenAI only
    └─ OpenAI Evals

9.4 三个常见错误

只跑 CI 不看线上:CI eval 跟生产数据分布不一致,虚高
只看线上不做 CI:回归暴露在用户面前
vendor lock-in:全栈一家,后面想换框架代价大

🌟 建议:至少跑两个独立框架(开源 + 一个观察平台),避免 lock-in。

DeepEval:deepeval.com | github
Promptfoo:promptfoo.dev | github
Phoenix:arize.com/docs/phoenix | github
LangSmith:docs.smith.langchain.com
Braintrust:braintrust.dev
MLflow Scorer:MLflow blog
RAGAS:github.com/explodinggradients/ragas
OpenAI Evals:github.com/openai/evals

横向对比

LLM Evaluation Frameworks: Head-to-Head Comparison (Comet):博文
LLM Eval Tools 2026 (Inference.net):博文
8 Best DeepEval Alternatives (ZenML):博文
AI Multiple LLM Eval Tools:博文

搜索