Prompt Testing | Notion

Test Configuration

Prompt Being Tested: System prompts, character descriptions, etc.

Test Date:

Evaluator: If collecting data from more than one person.

Evaluation Criteria

Rate each response on a scale of 1-5. 1 = Poor, 2 = Below Average, 3 = Average, 4 = Good, 5 = Excellent

Run	Model Name	Sampling Preset	Relevance	Accuracy	Clarity	Completeness	Follows Instructions	Consistency	Overall Quality	Notes
1
2
3
4
5
6
7
8
9
10

Summary Statistics

Metric	Average Score	Standard Deviation	Min Score	Max Score	Range
Relevance
Accuracy
Clarity
Completeness
Follows Instructions
Consistency
Overall Quality

Evaluation Criteria Definitions

Relevance (1-5): How well does the response address the specific question or task?

1: Completely off-topic or irrelevant
2: Partially relevant but misses key aspects
3: Generally relevant with some minor deviations
4: Highly relevant with minimal irrelevant content
5: Perfectly relevant and on-topic

Accuracy (1-5): How factually correct and truthful is the response?

1: Contains major factual errors
2: Some factual errors present
3: Mostly accurate with minor errors