Case Summarization: It Started With One HR Case

4 min read

Case Summarization: It Started With One HR Case

4 min read

Case Summarization: It Started With One HR Case

Case Summarization: It Started With One HR Case

At a glance

Problem: HR fulfillers lose time at every handoff reconstructing context that should already be visible and writing resolution notes the next person will have to decode anyway.

Approach: Two waves of research. First, controlled conditions to test whether summarization actually saved time. Then production, to understand why agents were not using it the way we expected. What we found shaped a pattern reusable enough to carry into other workflows.

Key decisions: Summary depth, placement, refresh behavior, editability and provenance, attachment handling, sentiment framing.

Result: A workflow-native summarization pattern now lives across HRSD, LSD, and CLM. Handoff time down 37%. Overall case handling time down 28%.


The person at the center: the HR Service fulfiller

Monday morning. An HR fulfiller opens a case that has already been touched by two other people. Notes are scattered, context is implied, and an attachment might contain the detail that matters.

Before they can help the employee, they do two invisible jobs:

  1. reconstruct what happened

  2. write what they did in a way the next person can trust

Nobody celebrates this work, but it is the work. And it quietly consumes time, focus, and quality across every handoff. This case started there. With one HR case and one goal: reduce the hidden work without creating new risk.


Why “live” was the beginning, not the finish

Once summarization exists in the product, the question changes. It is no longer “can AI summarize?” It becomes “will agents rely on it inside the workflow they already use?”

In HR work, trust is fragile. Agents handle sensitive situations, ambiguous context, and decisions that need to be explainable. A summary that is slightly off does not feel like a small error. It can feel risky.

So we treated this as an adoption and workflow-fit problem, where discoverability, freshness cues, attachment handling, and clear authorship all had to work together before agents would trust it enough to use it.


My role

Design Manager, HR Service Delivery (HR pilot owner)

HRSD was one of three pilot teams alongside CSM and ITSM. I owned the HR outcome, which meant deciding what questions mattered for HR specifically, making the design calls, and holding the standard for what we shipped.

A designer from my team worked in close partnership with the AI Platform research team and led the execution day to day. I set the direction, shaped the decisions around summary depth, sentiment framing, and the editability model, and stayed close enough to both waves of research to know when findings were changing what we needed to build.

The HR pilot was designed from the start to produce something transferable. The pattern that scaled to LSD and CLM started with the decisions made here.


The moment it clicked



What we did

We approached this in two waves, because the questions changed over time.

Wave 1: Validate productivity (controlled)

We measured whether summarization actually saved time, or just felt like it did. Nineteen HR agents, randomized conditions. The two moments we cared about were handoff speed and resolution note writing, tested with and without AI.

The answer was clear enough to build on.



Wave 2: Post-live reality check (adoption and trust)

Once the feature existed in the world, controlled conditions could no longer tell us what we needed to know. Agents were making real choices about whether to use the tool, when to trust it, and what happened when they reused summary content in their own work notes. What they did in practice turned out to be more complicated than what they did in a study.

This second wave is where the design nuance lived.

Who we tested with, and what environments we tested in across HR, CSM, and ITSM.



What we learned


  • Agents wanted more than a recap

    Agents were not asking for "a nicer paragraph." They were trying to avoid manual searching and reduce verification. They wanted temporal context, not just content. Sources, not magic. Attachment awareness, not surprises. A path forward, not a dead-end card.


  • Discoverability was the feature

    When agents could not immediately see the output, they assumed the feature failed. Some clicked summarize repeatedly, not because they loved it, but because the UI did not make the result obvious. In support work, “out of sight” becomes “not reliable.”


  • Update visibility was a trust lever

    A summary is only useful if it feels current. Subtle update cues get missed in real work. If the agent misses an update, they do not think “I overlooked it,” they think “this isn’t reliable.”


  • Change-highlighting had to match mental models

    Agents liked “show me what changed” because it reduces comparison work. But it only works when it is precise:

    1. highlight only what is new

    2. keep it visible long enough to notice

    3. include attachment-related updates too


  • Attachments defined whether the summary felt complete

    For HR cases, attachments often contain the real answer. The requirement was not perfect summarisation of every file. It was clarity and predictable behavior.


  • Sentiment had potential, but raised real risk

    Sentiment intrigued agents, but workflow purpose was unclear. People wanted it to update automatically, worried it could be used against them, and flagged emotional toll from constant negative exposure.

The decisions that shaped what shipped



Hard moment

The toughest conversation was not about AI summary quality. It was about where the summary deserved to live.

Workspace teams were cautious about adding another persistent surface. The concern was legitimate: the workspace was already dense, and anything new had to justify its footprint. From the fulfiller side, the fear ran the other way. If the summary was buried, agents would miss it, and in support work, anything that requires hunting gets written off as unreliable regardless of what it actually contains.

We moved past it by shifting the debate from opinion to workflow truth. The summary needed to sit where decisions happened, not in a secondary panel. And if we were going to keep the surface lightweight, we owed agents clear signals for when content had updated and what had changed. Keep the footprint scannable, invest in discoverability and update signaling, and the surface would earn its spot.


What changed in the experience

The research pointed in specific directions. These are the changes that followed:

  • Update visibility strengthened so agents could not miss when the summary had changed

  • Change-highlighting tightened to mark only newly added content, kept visible long enough to catch during real work

  • Source transparency was made explicit, the HR agents needed to know where the summary came from before they could trust it

  • Attachments surfaced as first-class context with predictable inclusion behavior

  • Authorship kept clear when summary content is reused in resolution notes

  • Sentiment held back until workflow purpose and emotional impact could be properly worked through

This is what made summarization feel less like "AI text" and more like a workflow tool.


Outcomes

Case summarization shipped in HR Agent Workspace. The summary card sits in the fulfiller's active workspace, visible the moment a case opens, not in a secondary panel.




In production, 32% of agents used it. A controlled study had just shown 37% time savings. The gap between those two numbers was not a quality problem. Agents who opened the summary did not know if it was current, what it had pulled from, or what would happen to it when they edited it. That is what the design work addressed. HRSD, LSD, and CLM are all running the pattern that came out of it.


Trust guardrails

This work was not about replacing agent judgment. It was about reducing hidden work without creating new risk.

The summary was never meant to eliminate verification. In sensitive HR cases, agents still check. Nothing in the design implied otherwise. Freshness cues and change-highlighting existed because a stale summary does not feel stale. It just feels wrong when something turns out to be missing. Sources and inclusion rules were surfaced for the same reason: agents needed to know what had been pulled before they would act on it. When content moved into resolution notes, the line between AI text and human edits stayed visible, because ambiguous authorship was a liability.

Sentiment was a different call. The emotional toll of repeated negative exposure and the risk of the signal being used against agents were not concerns to design around later. They were reasons not to ship it yet.

The decisions I owned

Recurring reviews kept coming back to the same three things: placement, visibility, refresh behavior. Not wording, not content structure. Those were the choices that would determine whether agents found the feature and came back to it. They needed more scrutiny than they were getting.

The provenance question changed register the moment editing and reusing summary content into work notes became real. Until then it was a design consideration. After that it was a liability. I pushed for it to be resolved before anything shipped, because unclear authorship in an HR context is not something you patch in iteration.

The hardest discipline call was the roadmap split. Open prompting and quick actions were real opportunities and I knew it. Keeping them off the active track meant accepting that we would ship something smaller than what was possible. It also meant neither got built badly in a hurry. That line held.


How it scaled

HRSD was the proving ground. What it produced was not treated as HR-specific. The real test was whether the decisions behind it were sound enough to hold elsewhere.



What HR taught that generalized: freshness cues, attachment clarity, and authorship signals matter across domains because they govern trust. What needed tuning in LSD and CLM: which sections carried the most weight and how next steps should be expressed based on workflow ownership and risk.


What we shipped. What we held.



What I would do differently

Enterprise AI features are interaction design problems first. Model capability matters, but adoption depends on how clearly the system communicates updates, provenance, and safe reuse inside real workflows. This project reinforced that.

If I were doing it again, I would accelerate the learning loop in three ways:

Measure adoption friction earlier. Pair qualitative research with lightweight product signals earlier in the pilot: repeated summary regenerations, attachment open rates, and time spent verifying summaries versus using them directly.

Bring the full workspace into research sooner. Many of the most important questions, especially discoverability and refresh cues, only surfaced once the summary appeared inside the agent workspace.

Explore attachment intelligence earlier. The study showed how often agents rely on attachments to understand cases. I would treat attachment summarization as a first-class problem earlier, not a follow-on capability.

Contents

Duration and date

2 Months

December - November 2023

Duration and date

2 Months

December - November 2023