New HiFi Benchmark measuring AI Reliability for IT Service Management

praveen-Thunk.AI · February 23, 2026, 3:14am

Hi all,
We have been working for the last two months on a benchmark for reliable AI automation of ITSM scenarios. Here is the link to the benchmark report. Within the report are links to the github repo for the benchmark and the detailed implementation results.

The summary is:

The benchmark models not just an IT ticketing system (rather boring) but also the IT components of an E-Commerce site (servers, databases, networks, etc) along with dependencies. It is a mini-CMDB. And there are 20 different problematic states of this site, each represented with observability data (logs, metrics, etc) and tools to take corrective action. And about 5 tickets filed for each scenario, modeling someone submitting a service ticket.
The AI agentic workflow needs to validate, diagnose, and take remedial action across 100 tickets. It escalates just 6% of the time, and the rest of the time, it demonstrates 99% accuracy.
We also now have a generic measurement harness that stands up a test environment as an MCP server, runs AI tests, and measures AI reliability. It can be valuable as a testing/validation mechanism for any AI agent scenario.

Welcome your feedback and questions

-praveen

Topic		Replies	Views
Starting up this user community Product Updates	0	28	May 27, 2025
Execution AI is not visible when Status = Final General Questions	3	33	November 12, 2025
About the Agent and LLM category Agent and LLM	0	19	April 7, 2025
Connecting Salesforce (or another such enterprise application) to a thunk General Questions	0	33	June 9, 2025
Exporting Steps General Questions	0	22	October 23, 2025

New HiFi Benchmark measuring AI Reliability for IT Service Management

Related topics