Batch Evaluation

10 scenarios — auto-graded rubrics + manual rating

Batch Eval Chat A/B Compare Consistency Regression Demo gpt-4.1