Self-Healing And Sentinel

Self-healing helps IT operations teams keep endpoint recovery work moving without giving automation unlimited authority. Use it to see which endpoints are healthy, which ones need attention, what Pharaoh tried, and where a human approval or rejection is required before work continues.

The main operator jobs are:

check fleet-level self-healing health and pending escalation volume
open endpoint-specific Sentinel history before approving risky work
review the automation request, policy context, and expiry time for an escalation
approve only the bounded action Pharaoh asked for, or reject it with a reason
confirm that the endpoint returned to a known-good or intentionally deferred state

Where It Lives

Use Self-Healing in the Operations navigation group.

The main routes in this product area are:

/self-healing for the self-healing overview
/self-healing/settings for organization-wide Sentinel and Autoheal policy defaults
/self-healing/escalations for the escalation queue
/self-healing/escalations/<escalation-id> for one escalation review
/endpoints/<endpoint-id>/self-healing for the endpoint self-healing Status view inside the endpoint detail shell
/endpoints/<endpoint-id>/self-healing/sessions for linked recovery sessions
/endpoints/<endpoint-id>/self-healing/sentinel for active Sentinel provenance and source
/endpoints/<endpoint-id>/self-healing/sentinel-runs for Sentinel run history
/endpoints/<endpoint-id>/self-healing/activity for the paginated endpoint self-healing activity timeline
/endpoints/<endpoint-id>/self-healing/knowledge for accepted endpoint knowledge and pending proposals

Endpoint details also include a Sentinel panel. Use Self Healing in the endpoint tabs to open the endpoint-scoped self-healing workspace. Older /self-healing/endpoints/<endpoint-id> links redirect into the endpoint detail shell.

Self-Healing Overview

The Self-Healing page is the fleet-level entry point for daily triage. Start here when you need to know whether automation is waiting on people, whether a specific endpoint needs review, or whether the queue is clear.

Self-Healing overview showing the pending escalation list and endpoint lookup.

The mobile overview keeps the same triage path available for on-call work: pending count, endpoint lookup, settings, and review rows.

Mobile Self-Healing overview showing pending escalation triage controls.

Use the overview in this order:

Check the pending escalation count to understand whether a human decision is blocking automation.
Scan the pending rows for endpoint, category, requested action, and age.
Use Refresh when you are taking over an active queue from another reviewer.
Open a specific endpoint when the row does not provide enough context.
Use Review only after you know which endpoint and action you are evaluating.

What you can do there:

see the number of pending escalations currently loaded from the self-healing projection
refresh the pending escalation list
enter an Endpoint ID and select Open endpoint
select Settings to edit shared Sentinel and Autoheal defaults
inspect pending escalation rows and choose Review

Use the pending count as a workload signal, not as proof that the fleet is unhealthy. One endpoint can generate a high-priority approval request even when most endpoints are passing Sentinel checks. If there are no pending rows, Pharaoh shows No pending self-healing escalations instead of an empty table.

Self-Healing Settings

The Self-Healing Settings page controls the shared organization-wide config that endpoints use when they do not have an endpoint-specific override.

Settings include:

Sentinel enabled
Autoheal enabled
Read-only policy template ID
Autoheal policy template ID
Sentinel cadence
Execution interval seconds
Regeneration interval seconds

Treat these as fleet policy defaults. Before changing them, check whether the issue is isolated to one endpoint, one policy template, or a broader operational rule. The config does not include a platform setting. Pharaoh uses the endpoint agent’s observed runtime platform when Sentinel generation or execution runs.

Endpoint Sentinel Panel

The endpoint detail page shows a Sentinel panel between the endpoint identity summary and the endpoint detail tabs.

The panel can show:

Sentinel status, such as Passed, Failed, Timeout, Invalid output, Policy denied, Runner error, Stale, or Not configured
the latest summary text
Last run, Duration, Version, and Policy
links to an active self-healing session or agent thread when one is projected
a pending escalation count
accepted knowledge and pending proposal counts

Use this panel when you need a fast answer to whether the endpoint has recent Sentinel context before opening deeper self-healing history. For approval work, stale or missing Sentinel context is a reason to slow down and inspect the endpoint page instead of approving from the queue alone.

Endpoint Sentinel panel showing current health, run timing, and self-healing links.

The panel is also usable on a phone during on-call review. The same evidence remains visible: status, summary, run time, policy, active session or thread, pending escalation count, and knowledge counts.

Mobile endpoint Sentinel panel showing the same health, escalation, and knowledge cues.

Before approving from this context, confirm:

The endpoint identity matches the ticket, alert, or user report.
Last run is newer than the incident context you are acting on.
The summary explains why Pharaoh needs help.
The policy and version match the expected guardrail boundary.
Pending escalations and knowledge counts are consistent with the request.

Endpoint Self-Healing Detail

The Endpoint Self-Healing workspace is the operational record for one endpoint. It stays inside the endpoint detail page, so the endpoint header and endpoint tabs remain visible while you move through self-healing history. Open it before approving work when the request could change endpoint state, require elevated access, or affect a business-critical user.

The workspace has six endpoint-scoped subviews in the left rail.

Status

current self-healing and Sentinel state
operator-readable facts for policy, active Sentinel, latest run, accepted knowledge, pending proposals, and pending escalations
recent activity preview
links into the detailed subviews when a fact needs inspection

Use Status first. It answers whether the endpoint is healthy, whether automation needs action, and whether the current policy and Sentinel context are recognizable without exposing raw database ids as primary labels.

Endpoint self-healing Status view showing current health, facts, and recent activity.

On mobile, Status keeps the same endpoint context, self-healing rail, facts, and activity preview without requiring horizontal scrolling.

Mobile endpoint self-healing Status view showing current health and facts.

Sessions

linked recovery sessions for this endpoint
session status and terminal outcome
agent thread links when projected
dates and friendly titles instead of primary ids

Use Sessions to understand what Pharaoh already attempted and whether the requested action is a continuation of a known recovery path or a new branch of work.

Endpoint self-healing Sessions view showing linked recovery work.

Mobile endpoint self-healing Sessions view showing linked recovery work.

Sentinel

active Sentinel version and provenance
generation, validation, and activation context
latest execution state
full-width Sentinel source below the provenance and detail cards
View full script when you need the complete source in a modal

Check Sentinel for recency and consistency. A recent Passed result can support approval for a narrow follow-up. Repeated Failed, Timeout, Policy denied, or Runner error results suggest you should inspect the session and policy context before deciding.

Endpoint self-healing Sentinel view showing active Sentinel provenance and source.

Mobile endpoint self-healing Sentinel view showing provenance and source.

Run history

historical Sentinel executions
generated run titles, status, completed time, and duration
output summaries and checks when available
session links for runs that triggered recovery work

Use Run history when the latest result is not enough. A single failed run may be transient; repeated failed, timed out, or policy-denied runs are stronger evidence that approval should slow down.

Endpoint self-healing Run history view showing Sentinel execution history.

Mobile endpoint self-healing Run history view showing Sentinel execution history.

Activity

combined self-healing activity synthesized from existing Sentinel, session, knowledge, proposal, and escalation records
event-specific labels, status, source, and time
links to the relevant endpoint self-healing subview when available
pagination for longer histories

Use Activity when you need the complete timeline rather than the short Status preview. The list is a projection for operator review, not a separate audit log or new persisted event store.

Endpoint self-healing Activity view showing the paginated event timeline.

Mobile endpoint self-healing Activity view showing the paginated event timeline.

Knowledge

accepted endpoint self-healing knowledge
pending knowledge proposals
proposal Approve and Reject controls when your role can review proposals
a required rejection reason before Reject is enabled

Endpoint knowledge is endpoint-specific self-healing memory. Approve knowledge proposals only when they describe durable, endpoint-relevant facts that should help future recovery. Reject proposals that are speculative, temporary, user-specific, or better suited to organization-wide documentation.

Endpoint self-healing Knowledge view showing accepted knowledge and pending proposals.

Mobile endpoint self-healing Knowledge view showing accepted knowledge and pending proposals.

When reviewing knowledge before an escalation decision, look for facts that explain the current failure: known service names, endpoint-specific maintenance windows, hardware limitations, or previously accepted false positives. Do not treat pending proposals as trusted evidence until a reviewer has approved them.

Organization-wide runbooks and imported support content still live in IT Knowledge Base.

Structured Outcome Cards In Agent Worklogs

Self-healing sessions write final structured outcomes into the same Agent Core worklog used by endpoint sessions. Pharaoh renders those outcomes as compact operational cards instead of treating the final answer as ordinary prose.

Card types you may see:

Sentinel candidate when Sentinel generation or regeneration produced a candidate script, validation state, activation state, and endpoint update dispatch state.
Self-healing investigation with outcome Fixed, False positive, Unable to fix escalated, or Ignored not applicable.
Structured output validation failed when the assistant could not produce a valid final output after repair attempts.
Unknown structured output contract when a future contract is visible before the local UI has a purpose-built renderer.

Use the cards as audit evidence. Check the status badge, summary, processing timeline, trace links, and any learning, escalation, or regeneration section before deciding that automation finished correctly. A false-positive card can include a separate regeneration recommendation; that does not mean the active Sentinel changed until validation and activation state confirm it.

Current screenshot replay page IDs for these card states are tracked in the screenshot manifest:

self-healing-candidate-card
self-healing-investigation-card-fixed
self-healing-investigation-card-false-positive
self-healing-investigation-card-escalated
agent-core-structured-output-validation-failure
agent-core-structured-output-unknown-fallback

Escalation Queue

Use escalation links from Self-Healing, endpoint self-healing pages, or endpoint Sentinel panels when you need to find, filter, or review escalation records.

Escalation queue showing filters and review actions for self-healing requests.

On mobile, use the same queue checks before opening Review: current status, endpoint, category, requested action, and whether the record is still pending.

Mobile escalation queue showing pending review rows and filters.

The queue includes:

Search
Endpoint
Status
Category
Apply
Refresh
pagination controls when there are multiple pages

The status filter supports Pending, Approved, Rejected, and Expired. The category filter includes Policy override, Sentinel generation failure, Self-healing failure, Permission request, and Other.

Use filters to separate active decisions from audit review. Pending is the approval workload. Approved, Rejected, and Expired are useful when you need to understand prior handling or repeated endpoint behavior. Every row keeps Review available so approved, rejected, and expired escalations remain inspectable.

Escalation Review

The Self-Healing Escalation detail page shows the escalation id in the header, the current status badge, endpoint id, thread id, category, requested action, created and expiry times, policy snapshot, any recorded grants, and the decision area.

Escalation review detail showing the requested action, policy snapshot, and decision controls.

The mobile review page keeps the decision controls close to the request summary, so it is suitable for review but still requires the same evidence check before approval.

Mobile escalation review detail showing requested action and approve/reject controls.

Before deciding, check:

endpoint id: confirm the request is for the endpoint you intended to review
category: understand whether this is a policy override, failure recovery, permission request, or other escalation
requested action: approve only the specific continuation described, not a broader class of future work
expiry time: reject or let stale requests expire instead of approving work whose context may no longer be valid
policy snapshot and grants: confirm the requested action fits your organization’s guardrails
endpoint context: open the endpoint self-healing page when Sentinel status, session history, or knowledge affects the decision

Approve when the request is specific, current, policy-compatible, and backed by endpoint context that explains why automation needs the escalation. Reject when the request is too broad, stale, unsafe for the endpoint state, unsupported by evidence, duplicates a failed pattern, or should be handled manually.

For pending escalations:

operators without review permission can inspect the escalation but do not see approve or reject actions
reviewers can select Approve
reviewers can enter Rejection reason and then select Reject

After a decision, the page reports Escalation approved., Escalation rejected., or the backend error message returned by the API.

Success Checks

After approving or rejecting, confirm the operational outcome:

refresh the escalation queue and verify the status changed or the pending item cleared
reopen the endpoint self-healing page and check the latest Sentinel, session, and escalation history
confirm the endpoint’s current state matches the reason for the decision
look for repeated escalations from the same endpoint before treating the issue as resolved
document any durable endpoint-specific learning through the endpoint knowledge review flow when a proposal is available