Eval Workbench
The Eval Workbench workspace tab is the app’s current in-app path for testing skill behavior. It groups two modes under one surface:
- Performance for output-quality prompt sets, runs, and grading
- Trigger for description-candidate generation and trigger comparisons
When a run exposes weak output or routing boundaries, you can send an improvement brief directly to Refine.
Open the workbench
- Select a skill in the dashboard.
- Open the skill workspace.
- Switch to the Eval Workbench tab.
- Choose Performance.
What's on this screen
The page has three main sections:
- Eval Workbench header with a Run prompt set button.
- Prompt set editor where you create and save app-owned evaluation cases.
- Run history and Run details for reviewing completed runs and sending feedback to Refine.
Create or update a prompt set
- Open Eval Workbench and stay on Performance.
- In Prompt set, click New prompt set if you want a fresh draft.
- Enter a Prompt set name.
- For each case, fill in Case prompt and Expected outcome.
- Click Add case to include more cases, or delete a case with the trash button.
- Click Save prompt set.
Saved prompt sets appear as buttons near the top of the page. Click a prompt set name to load it back into the editor.
Run a prompt set
- Open Eval Workbench and stay on Performance.
- Select the prompt set you want to run.
- Click Run prompt set.
The workbench adds the run to Run history and loads its results into Run details when the run finishes.
Review run history and results
Use Run history to inspect prior runs:
- View latest run opens the newest run.
- View run opens any older run.
- Each row shows the run ID, status, and passed/total summary.
Use Run details to inspect case-by-case results:
- Case shows the saved case ID.
- Target shows the candidate that was graded.
- Score and Status show the recorded result.
- Reason explains why a case failed when the grader returned one.
If no run is selected, the page shows Select a run to inspect its case results.
Send run feedback to Refine
- Open a completed run from Run history.
- Review failures in Run details.
- Click Send to Refine.
The workbench builds an improvement brief from that run and opens the Refine tab with the brief ready to use.
What you'll see
- No workspace — Configure a workspace before using Eval Workbench.
- Loading — Loading Eval Workbench…
- Load error — an error message with Retry
- No runs yet — No runs yet.
- No selected run — Select a run to inspect its case results.
- No recorded results — This run has no recorded case results yet.
Quick reference
| Control | What it does |
|---|---|
| Run prompt set | Starts a run for the selected saved prompt set |
| New prompt set | Clears the editor for a new prompt-set draft |
| Prompt set name | Names the saved set of cases |
| Case prompt | The request the skill should answer |
| Expected outcome | The expected response or behavior |
| Add case | Adds another case to the prompt set |
| Save prompt set | Persists the current prompt set |
| View latest run | Opens the newest run in Run details |
| View run | Opens an older run in Run details |
| Send to Refine | Builds an improvement brief and opens Refine |