Batch Evaluation
10 scenarios — auto-graded rubrics + manual rating
Batch Eval
Chat
A/B Compare
Consistency
Regression
Demo
gpt-4.1
Run All 10
Save Results
Save as Baseline