Evaluations, Datasets & Polish

Fresh Off the Perch

The past couple weeks have been incredibly productive! I’ve been focused on shipping the evaluations feature that makes testing your prompts systematic and reliable. Here’s everything new:

Features

  • Evaluations (Evals): The big one! You can now run systematic evaluations of your prompts against datasets. This lets you test how your prompts perform across multiple inputs and track the results.
  • Eval Status Tracking: Evaluations now properly report their status (pending, running, completed, failed) so you always know what’s happening.
  • Dataset JSON Viewer: Added beautiful tabbed views for dataset items with “Preview” (default) and “View as JSON” modes. The JSON view includes syntax highlighting and proper formatting.
  • Notifications System: Fully functional notifications with the ability to mark them as read, filter by tabs, and see real-time updates.
  • Sidebar Tooltips: Hover over any sidebar navigation item to see helpful tooltips explaining what each section does.

Improvements

  • Version Selector: Made the prompt version selector always visible (even when disabled) so it’s clearer where to select versions. No more confusing “where did the selector go?” moments.
  • Workspace Persistence: Your selected workspace now persists across page refreshes, so you don’t lose your place when navigating around.
  • Evaluation Tables: Added dataset names to evaluation results tables for better context about what you’re looking at.
  • Test Coverage: Added comprehensive e2e tests for datasets, evaluations, sidebar tooltips, version selector behavior, and workspace persistence. Currently at 44 passing e2e tests!
  • Seed Data: Completely reorganized the seed data strategy to make development and testing more reliable.

Bug Fixes

  • Fixed issues with evaluation runs not properly tracking their status
  • Fixed dataset items not displaying correctly in some scenarios
  • Fixed invitation flows that weren’t working properly
  • Resolved workspace context issues when switching between pages
  • Fixed various test failures and flaky tests

Behind the Scenes

I’ve been doing a ton of infrastructure work to make the app more maintainable and testable. This included:

  • Implementing proper LLM adapter patterns with mocked tests (no real API calls in tests!)
  • Setting up the evaluation service to handle async processing
  • Improving error handling throughout the eval pipeline
  • Better TypeScript types for API responses
  • Organized test utilities and helpers

Like a parakeet carefully arranging its perch, I’ve been meticulously organizing the codebase to make future features easier to build. The eval system is particularly exciting because it opens up so many possibilities for prompt optimization and testing workflows!

The JSON viewer for datasets was a fun addition - being able to toggle between a pretty preview and raw JSON makes debugging so much easier.

Stay tuned for more updates as I continue building out the evaluation analytics and reporting features! 🦜

🦜 This changelog was chirped together by an LLM from code commits and project updates