系统地测试变化的策略

有时很难确定一个改变(例如,新的指示或新的设计)是使系统变得更好还是更糟。通过查看一些示例可能可以暗示哪个更好,但是对于小样本量来说,很难区分是真正的改进还是随机运气。也许这个改变对某些输入的性能有所帮助,但对其他输入的性能有所损害。
评估程序(或“评估”)对于优化系统设计是有用的。好的评估具备以下特点:

  • 代表真实世界的使用情况(或至少具备多样性)
  • 包含许多测试案例以获得更大的统计能力(请参考下表以获取指导方针)
  • 易于自动化或重复执行

输出的评估可以由计算机、人类或二者结合来进行。计算机可以通过客观标准(例如,具有单一正确答案的问题)自动化评估,也可以通过一些主观或模糊标准进行评估,其中模型的输出由其他模型查询来评估。OpenAI Evals是一个开源软件框架,提供了创建自动评估的工具。
基于模型的评估在存在一系列可能的输出,且这些输出在质量上被认为是相等的情况下(例如,对于需要长答案的问题)是有用的。基于模型的评估与需要人类进行评估的情况之间的界限是模糊的,并且随着模型能力的提升,这个界限不断变化。我们鼓励进行实验,以确定基于模型的评估在您的用例中的效果如何。

通过与黄金标准答案对比评估模型输出结果

假设我们已知问题的正确答案应该涉及到一组特定的已知事实。然后我们可以使用模型查询来计算答案中包含的必要事实的数量。

例如,使用以下系统消息:

Role Prompt
SYSTEM You will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer:
- Neil Armstrong was the first person to walk on the moon.
- The date Neil Armstrong first walked on the moon was July 21, 1969.
For each of these points perform the following steps:
1 - Restate the point.
2 - Provide a citation from the answer which is closest to this point.
3 - Consider if someone reading the citation who doesn’t know the topic could directly infer the point. Explain why or why not before making up your mind.
4 - Write “yes” if the answer to 3 was yes, otherwise write “no”.
Finally, provide a count of how many “yes” answers there are. Provide this count as {“count”: }.

这是一个满足上述条件的示例输入:

Role Prompt
SYSTEM
USER “”“Neil Armstrong is famous for being the first human to set foot on the Moon. This historic event took place on July 21, 1969, during the Apollo 11 mission.”"”

这是一个满足上述条件的示例输入:

Role Prompt
SYSTEM
USER “”“Neil Armstrong made history when he stepped off the lunar module, becoming the first person to walk on the moon.”"”

这是一个满足上述条件的示例输入:

Role Prompt
SYSTEM
USER “”“In the summer of '69, a voyage grand,
Apollo 11, bold as legend’s hand.
Armstrong took a step, history unfurled,
“One small step,” he said, for a new world.”"”

在这种基于模型的评估中,有许多可能的变体。考虑以下变体,它跟踪候选答案与黄金标准答案之间的重叠程度,并且还跟踪候选答案是否与黄金标准答案的任何部分相矛盾。

Role Prompt
SYSTEM Follow these steps.
Step 1: Reason step-by-step about whether the submitted answer compared to the expert answer is either: disjoint, a subset, a superset, or has equal sets of information.
Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.
Step 3: Output a JSON object structured like: {“containment”: “disjoint” or “subset” or “superset” or “equal”, “contradiction”: True or False}

以下是一个包含不合格答案的示例输入:

Role Prompt
SYSTEM
USER Question: “”“What event is Neil Armstrong most famous for and when did it occur? Assume UTC time.”“”
Submitted Answer: “”“Didn’t he walk on the moon or something?”“”
Expert Answer: “”“Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969, as part of the Apollo 11 mission by NASA. Armstrong’s famous words when he stepped onto the lunar surface, “That’s one small step for man, one giant leap for mankind,” are still widely quoted today.
”"”

以下是一个包含良好答案的示例输入:

Role Prompt
SYSTEM
USER Question: “”“What event is Neil Armstrong most famous for and when did it occur? Assume UTC time.”“”
Submitted Answer: “”“At approximately 02:56 UTC on July 21st 1969, Neil Armstrong became the first human to set foot on the lunar surface, marking a monumental achievement in human history. Aldrin joined him on the surface about 20 minutes later.”“”
Expert Answer: “”“Neil Armstrong is most famous for being the first human to set foot on the Moon. This historic event took place on July 21, 1969, during the Apollo 11 mission.”"”

请我喝杯咖啡吧~

支付宝
微信