New HiFi Benchmark measuring AI Reliability for IT Service Management

Hi all,
We have been working for the last two months on a benchmark for reliable AI automation of ITSM scenarios. Here is the link to the benchmark report. Within the report are links to the github repo for the benchmark and the detailed implementation results.

The summary is:

  • The benchmark models not just an IT ticketing system (rather boring) but also the IT components of an E-Commerce site (servers, databases, networks, etc) along with dependencies. It is a mini-CMDB. And there are 20 different problematic states of this site, each represented with observability data (logs, metrics, etc) and tools to take corrective action. And about 5 tickets filed for each scenario, modeling someone submitting a service ticket.

  • The AI agentic workflow needs to validate, diagnose, and take remedial action across 100 tickets. It escalates just 6% of the time, and the rest of the time, it demonstrates 99% accuracy.

  • We also now have a generic measurement harness that stands up a test environment as an MCP server, runs AI tests, and measures AI reliability. It can be valuable as a testing/validation mechanism for any AI agent scenario.

Welcome your feedback and questions

-praveen