Part 1: Copilot Use And Operational Guidance
This section is for SREs, on-call engineers, platform engineers, and other pilot users who need to understand what the copilot is for, how to interpret its output, and which guardrails already exist.
This section is for SREs, on-call engineers, platform engineers, and other pilot users who need to understand what the copilot is for, how to interpret its output, and which guardrails already exist.
This cluster explains what the copilot is, what kind of product it is trying to be, and which architectural and operational assumptions shape the current pilot.
The current version is a pilot release. It is intended to help SREs triage incidents faster, gather stronger evidence, and reach practical next actions with lower noise.
The copilot is built to reduce time spent searching across Kubernetes, logs, delivery context, and general incident clues. It is meant to narrow the search space, surface the strongest current evidence, and give engineers a better first operational direction.
The main value is not blind automation. The value is faster understanding, cleaner prioritisation, and more trustworthy guidance during incident triage.
This copilot is primarily meant for SREs, on-call engineers, and platform engineers who need a fast operational overview during Kubernetes incidents. It is especially useful for the first triage pass, when the main goal is to reduce noise and narrow the likely problem area quickly.
It is less suitable as a standalone decision-maker for high-risk production changes. Human engineering judgment remains required before taking recovery actions.
The current copilot does not perform write actions against the cluster, does not automatically remediate incidents, and does not guarantee the final root cause. It can highlight strong evidence and rule-derived fix hints, but the operator answer should come from the LLM after checking those hints against the wider incident context.
The copilot also does not try to become its own identity platform. In the intended management-cluster direction, access is expected to be controlled by the surrounding platform and ingress layer rather than by a separate application-specific login flow inside the copilot itself.
The current copilot should be understood as a standalone application with a single-agent operating model. In the current pilot, one copilot instance gathers live context, reasons over the incident, and presents the result through one application experience rather than through a network of specialised cooperating agents.
That is intentional for the pilot. It keeps the operating model simpler, makes behaviour easier to explain, and reduces architectural complexity while the team is still validating real SRE usefulness. Later versions may add MCP for more standardised tool and context access, and may add A2A patterns if separate agents for systems such as GitLab or Azure DevOps become operationally valuable.
The current interface is the active operator layout for the pilot and should be treated as the current product surface for daily use, evaluation, and further refinement.
Board scan timestamps are stored in UTC on the backend and are currently rendered in the UI as Europe/Amsterdam time. This is intentional so the displayed scan time remains consistent even if the cluster later runs outside the Netherlands.
The current pilot does not use RAG in the live copilot flow. This is intentional. Platform Radar usually needs exact live facts, labels, ownership mappings, source state, and structured lookups more than semantic document retrieval.
For current incidents, live Kubernetes, ArgoCD, logs, metrics, billing data, and structured metadata should win over copied knowledge. A future RAG layer may still be useful for reviewed runbooks or stable background knowledge, but it should not become the source of current operational truth.
The copilot now uses a hybrid model. Stable platform assumptions, operating principles, and intended tool directions are documented and maintained on purpose. Examples are the current GCP and GKE direction, Traefik as ingress direction, ArgoCD as GitOps direction, and the expected observability stack. Those are design assumptions and should still be updated in the documentation and environment profile when the intended platform direction changes.
At the same time, changing runtime facts should not rely only on manual documentation updates. The pilot now includes live runtime discovery for items such as Kubernetes version, ingress classes, key namespaces, and reachability of configured sources such as Prometheus, Loki, Tempo, and ArgoCD. This reduces drift when versions or active components change in the running environment.
Today the copilot provides a live incident board, incident detail pages, one operator answer per incident, LLM-assisted summaries, follow-up questioning, ArgoCD delivery context, conservative root-cause grouping, Cost Insight, a Platform Health role page, PWA push notifications for new High incidents, prompt editing for the pilot, LLM budget guardrails, and fallback behaviour for provider or budget failures.
Future versions may add deeper GitOps and change correlation, targeted on-call notification routing, ChatOps integration, historical incident matching, broader evidence-backed fix patterns, stronger governance around prompts, stronger production authentication and access control, and more advanced scaling options only if real usage requires them. Backlog 2 may later introduce a coordinator, MCP tool boundary, and separate Incident Specialist/Supervisor roles, but those are not active runtime components today.
Cost Insight is now a separate page inside Platform Radar. It keeps cost-center and provisioning questions away from the live incident board so management users can focus on cost and savings while SRE users can focus on provisioning and workload sizing.
The page should combine several sources instead of relying on a single signal. The current direction is Kubernetes API data for deployed resources and requests, Prometheus or Thanos usage over time for real CPU and memory consumption, and GCP billing export data for management-cluster and attached-cluster cost. The goal is not only to say that a deployment looks oversized, but also to connect that to actual cloud cost where possible.
The expected outputs would be things such as: likely overprovisioned node pools, deployments whose requests appear much larger than real usage over time, workloads whose memory settings may be too low relative to observed peaks, and clusters or environments whose cost level seems out of proportion to their operational value. Those recommendations should stay conservative and evidence-based. Rightsizing advice should not be based on a single point-in-time snapshot, but on usage over time and on cluster context.
The page should remain environment-aware. Dev, Stage, Prod, and Management do not all deserve the same operational interpretation. In the intended source split, Management and Prod carry workload monitoring weight, while Dev and Stage can still be visible as environments without creating normal workload incident noise.
This capability requires more source access than basic incident triage, especially read access to GCP billing exports or curated cost datasets, plus enough metadata to map cost back to clusters, node pools, namespaces, products, and workloads. It should remain exact-label driven where possible, for example using gcp-costcenter instead of guessing products from customer namespace names.
Platform Health is now a separate page that defines the preventive Platform Health Specialist role. It should focus on reliability risk before incident response is needed, not on raw monitoring charts. Its job is to answer what is likely to become a problem if nobody acts, why that matters, which cluster or product is affected, and what the recommended next action is.
One important future lane is expiry and rotation awareness. The copilot should be able to surface operational objects that need renewal before they break the platform, such as certificates, PATs, API keys, registry credentials, service account keys, ArgoCD repo credentials, vendor licences, and other time-bound dependencies.
This capability must be metadata-only. The copilot should not collect, display, store, or reason over secret values, private keys, token contents, PAT contents, licence keys, or passwords. The useful signal is the safe operational metadata: object name or reference, type, owner, affected product or cluster, expiry date, last rotation date, renewal location, urgency, blast radius, and evidence source.
The expected output should therefore be practical and action-oriented: renew this before a given date, involve this owner or team, validate this dependency after renewal, and understand that these products or clusters may fail if the renewal is missed. This fits naturally under the Platform Health Specialist rather than as a separate secret-management feature.
There are already several tools in the market that overlap with parts of this product. The most relevant examples are Kubernetes-oriented tools such as Komodor, Botkube, K8sGPT, HolmesGPT or Robusta, Kubently, and broader incident platforms such as incident.io, FireHydrant, and PagerDuty. Those products are often more mature in one or more dimensions such as enterprise workflow, chat integrations, alert orchestration, remediation playbooks, or large commercial integration ecosystems.
That does not automatically make them a better fit for this pilot. Many of those tools are broader, heavier, or more expensive than what is needed here. Some are strongest as chat-based assistants, some as generic incident-management suites, and some as general AIOps or platform products. This copilot is intentionally narrower. It is Kubernetes-first, read-only by default, evidence-driven, low-noise, and designed to reduce time spent searching across multiple technical sources during triage. That focus is one of its main strengths rather than a weakness.
For this environment, the current copilot also aligns better with the practical needs of the team. It is easier to adapt to the local platform reality, easier to extend with management-cluster and attached-cluster knowledge, easier to shape around the actual Dev, Stage, and Prod operating model, and easier to keep maintainable because the scope remains controlled. It can also be extended in deliberate steps instead of forcing the team into a much broader platform commitment too early.
Cost and maintenance matter as well. Commercial platforms can be strong, but they often bring custom pricing, seat-based pricing, node-based pricing, or enterprise commitments that are significantly larger than the cost of a focused internal copilot plus hosted model usage. For this pilot, the copilot can likely be run at much lower model cost than the cost of repeated external support time or the cost of engineers spending too much time searching across tools. That makes the business case different from buying a broad platform with many features the team may not need yet.
The strongest current differentiators of this copilot are therefore: a read-only trust model, a dedicated operator-facing interface instead of a chat-only experience, a low-noise incident board, one clear operator answer, workflow learning through pickup, resolve, and evaluation, separate Cost Insight, and a preventive Platform Health direction. Those things fit this team's needs better than a broader incident suite or a more autonomous remediation product.
Useful later additions inspired by stronger market products should stay focused. The most valuable ones are likely deeper historical incident matching, clearer evidence-versus-hypothesis confidence framing, stronger change correlation, a supervisor-style review pass over the first analysis, and stronger source/ownership metadata. Those are useful because they strengthen the current product intent. They should not be confused with features that already exist today, and they should not pull the pilot away from its main purpose as a practical Kubernetes incident copilot.
Right now the copilot checks live Kubernetes pod state, warning events, restart behaviour, readiness gaps, recent logs, previous crash logs where useful, deployment and pod runtime context, referenced object existence such as ConfigMaps, Secrets, and ServiceAccounts, live ArgoCD application context, and live runtime discovery information such as Kubernetes version, ingress classes, key namespaces, and configured source reachability.
The current multi-cluster direction is explicit: management and prod are workload-monitored, while stage and dev are environment-visible without normal workload incident monitoring. This keeps the board useful without turning development churn into SRE noise.
Later versions may expand this with deeper Prometheus and Loki analysis, more Tempo usage, rollout diffs, dependency and service checks, more explicit image-pull and storage patterns, broader fix-hint detection, structured ownership metadata, and stronger multi-source correlation for change and failure timing.
This cluster is for on-call reading during actual triage: what the priority levels mean, how to interpret the output, and how to work with the evaluation workflow.
High means the current live evidence suggests the issue needs immediate attention and is operationally relevant now. Medium means the issue matters but appears less urgent or less clearly disruptive. Low means the issue is currently less urgent, less actionable, or less likely to deserve immediate operational focus.
These categories are triage aids, not contractual severity classes. Engineers should still apply context and judgment.
Use the board to decide where to look first, not as a final truth engine. High urgency means the current evidence suggests the issue deserves immediate attention; it does not mean the copilot is automatically correct about the final root cause.
Use the detail page in layers. Start with the evidence highlights and technical context, then review fix certainty and evidence-backed likely fixes, then use next evidence steps and validation commands. When the copilot says an issue is confirmed from current evidence, fix that first before widening the search. When it says evidence is insufficient, treat the output as guided triage rather than a final solution.
Start with the evidence highlights to understand the current symptom pattern. Then review technical context and fix certainty to see whether the issue is already confirmed by live evidence. If a likely fix is evidence-backed, validate it with the suggested commands and act there first. If the copilot only provides next evidence steps, treat the case as ongoing triage and keep the search focused on the layer suggested by the evidence.
When you decide to work on an incident, use the workflow button in the detail page to mark that you are working on it. This tells other pilot users that the item is already being handled. In the pilot flow, an incident should be picked up before it is marked as resolved, so the evaluation history stays tied to a real engineer-owned work cycle.
The current pilot can send push notifications, but targeted on-call routing is future scope. The intended direction is that Platform Radar can route incident notifications to the right assigned responder or group based on an approved on-call or ownership source.
Platform Radar should not define the formal SRE schedule or decide who is contractually responsible. It should use that source of truth to prevent broad multi-channel noise, show who owns the incident, support pickup and handoff, and escalate when nobody responds within the agreed window.
The pilot now includes a simple operational workflow for evaluating usefulness over time. An incident can be marked as in progress, then marked as resolved, and then evaluated after the work is done. Resolved items stay visible in the dedicated resolved-and-evaluation view even after they disappear from the active board.
This view is intentionally split into two ideas: the incident itself, and our opinion about the copilot recommendation. Resolved incidents that still need scoring stay prominent only when someone actually picked them up. Older scored items move into a compact past-evaluations section instead of staying at the top of the page. Incidents that were resolved without copilot pickup can still be shown in a lighter history section, but they should not dominate the evaluation workflow.
Use I’m working on this when you actively pick up an incident. Your operator name should then appear back in the workflow status so other pilot users can see that the incident is being handled. In the current pilot this name is still confirmed manually by the user, with the last used name offered as a default that can be changed on each workflow action. In the intended management-cluster direction this should be filled automatically from the surrounding platform identity or login context, so SREs do not need to type their name by hand.
Use Mark as resolved when the fix has been applied or the issue has clearly been handled. Then go to the resolved-and-evaluation view and score the recommendation while the details are still fresh.
If the same incident pattern later returns on the active board, treat that as a new pickup cycle. Earlier ownership and evaluation history should remain available as past history, but the live incident itself should become claimable again instead of looking permanently owned by the previous engineer.
The goal is not blame or paperwork. The goal is to learn which recommendations are operationally strong, which ones are only partially helpful, and where the copilot still needs better evidence or reasoning. If the recommendation was right but incomplete, that should still be recorded as a useful outcome instead of being grouped with clearly wrong guidance.
SREs should read the statistics panel carefully. In progress, Resolved, and Awaiting evaluation describe picked-up workflow state. Evaluated and the usefulness percentages describe the pilot's opinion signal. The percentage cards are intentionally based on evaluated resolved incidents only, so unresolved or unevaluated items do not distort the usefulness signal. For each item being scored, the page should repeat the recommended fix being judged and keep the notes separate: the resolution note explains what the engineer tried, and the evaluator note explains what was missing or misleading in the copilot guidance.
This cluster groups the practical constraints of the current pilot, the known weaknesses, and the most likely next-stage operational controls.
The pilot currently uses read-only cluster behaviour, prompt-size caps, hourly and daily LLM token budgets, per-incident follow-up limits, incident-level LLM usage visibility, and notification logic that only pushes new High incidents after a device is enabled.
When an incident carries unusually large context, the copilot now first tries a compact analysis path instead of stopping immediately. If even the compact path still does not fit, the UI can offer an explicit extended-analysis override with a cost warning. If the daily LLM budget is exhausted, the pilot still blocks by default, but the UI can now offer an explicit human-approved exceptional overspend path for important incidents. This keeps the pilot guardrails in place while avoiding dead-end blocked screens for operationally important incidents.
A useful later addition would be an explicit control to pause copilot analysis without turning off the whole board. In that mode, the incident board and monitoring signals would remain visible, but new LLM summaries and follow-up analysis would be temporarily disabled until a user turns analysis back on.
The prompt editor is currently meant only for the Platform Radar maintainer or administrator who is actively experimenting with prompt behaviour during the pilot. It should not be treated as a normal feature for every SRE, engineer, or other pilot user. Prompt changes can materially affect explanation quality, confidence wording, likely fixes, and cost behaviour. A restore-to-default action exists per prompt so a maintainer can recover quickly if an edit goes wrong. In the local app, prompt changes are written to the local prompt files on disk. In the cluster deployment, prompt changes are stored on a dedicated persistent volume so they survive a normal pod restart. Later versions should still move prompts into production-grade version control, clearer approval flows for exceptional LLM spend, richer degrade modes during spikes, retention and cleanup policies, and stronger surrounding platform access control at the management-cluster and ingress layer.
The pilot still depends on the richness of live evidence. Weak logs or ambiguous runtime signals can still produce weaker guidance. Prompt editing is intentionally flexible during the pilot but is not yet under production-style governance. Some fix patterns are stronger than others, and wider source correlation is still a next-stage improvement.
This is still a pilot. Some incident patterns remain easier to classify than to fully solve, and some answers will depend on the quality of logs, manifests, and surrounding runtime evidence. A strong hint may not be the whole issue, and fixing it can reveal the next one.
The copilot should therefore be used as operational decision support, not as unquestioned authority. The most trustworthy outputs are the ones tied directly to live evidence, explicit missing objects, clear runtime misconfigurations, or well-verified fix hints.
The current interface is intentionally practical, denser, and more operator-oriented. It keeps the board and detail view side by side on larger screens, keeps the prompt editor separate from primary incident work, and aims to reduce scroll friction during live triage. Future UX work should now refine this active layout as the main product surface for SRE use.
This section is for developers and maintainers who need to run the copilot locally, reproduce the local lab, test the cluster deployment path, or maintain the pilot environment. It is not required reading for every pilot user of the app.
This cluster covers the minimum local baseline for a developer who wants to reproduce the current pilot environment with the least ambiguity.
For this pilot-style local environment, a practical minimum is a modern laptop or workstation with at least 4 CPU cores, 16 GB RAM, and roughly 20 GB of free disk space. A more comfortable local setup is 24 GB RAM or more, especially when running Kind, ArgoCD, Prometheus, Loki, Tempo, and Gitea at the same time, plus optional local LLM tooling such as Ollama when you are not using the hosted Mistral path.
These are recommended working requirements, not strict hard limits. The exact local footprint depends on how much of the local lab is enabled at once.
Engineers should expect to install or have access to Python 3.11 or newer, Docker Desktop or an equivalent local container runtime, Kind, kubectl, Helm, Git, and VS Code or an equivalent IDE. For the current local GitOps validation path, they should also run Gitea locally. For the current default hosted model path, they should have access to the Mistral Small 4 API credentials. Ollama is still a valid local fallback when they intentionally switch the app to a local model path.
VS Code is the current reference IDE because it works well with Python, YAML, Kubernetes manifests, Markdown, and GitOps-oriented repository changes. A practical local setup should include Python support, YAML support, Docker support, Git support, and Kubernetes awareness in the IDE. Engineers can use another editor if they prefer, but the reference instructions assume a VS Code-style development workflow.
This cluster explains what the local pilot environment is expected to contain, how it is shaped, and which reference values help orient a developer quickly.
The current example setup is based on a local Mac development environment using VS Code, Kind, Gitea, ArgoCD, Traefik, PostgreSQL, Prometheus, Loki, Tempo, and a hosted Mistral API path. A local Ollama instance remains an optional fallback for local-only model experiments. Engineers do not need to reproduce every optional local-lab convenience immediately, but this is the reference environment for the current pilot and should be treated as the baseline unless the project documentation is updated.
The current local stack is expected to include a management Kind cluster with context kind-agentic-local, optional slim source clusters for dev, stage, and prod, a local Gitea server for GitOps testing, ArgoCD in the management cluster, Traefik in the management cluster, the observability stack in the management cluster, the Platform Radar deployment in namespace platform-radar-local, local demo workloads, and access to the configured hosted LLM API unless a local LLM endpoint is configured instead.
The current reference setup is a local developer machine running macOS, VS Code, Docker, Kind, kubectl, Helm, Git, and a local Python 3.11 environment for direct development. Gitea is used as the local GitOps source. ArgoCD and Traefik run inside the Kind cluster. The copilot can be run locally during development and can also be deployed into the local cluster for pilot-style validation.
This reference setup is intentionally practical rather than minimal. It mirrors the management-cluster direction closely enough to validate GitOps, ingress, delivery context, observability lookups, and read-only incident workflows before broader rollout.
The current local cluster is expected to contain the operational dependencies that Platform Radar uses during the pilot. That includes ArgoCD, Traefik, Prometheus, Loki, Tempo, Platform Radar itself, an in-cluster PostgreSQL dependency for the local Kind test deployment, and the demo workloads used to validate incident detection and explanation quality.
In practice, that means engineers should not treat Kind as an empty shell. Platform Radar is most useful when the cluster contains both the workloads to analyse and the observability and GitOps components it is expected to correlate against.
In the current reference environment, engineers should expect to see at least these namespaces or their local equivalents: platform-radar-local for the local Kind deployment and its PostgreSQL dependency, demo workload namespaces in the local source clusters, argocd for ArgoCD, traefik for ingress, and an observability namespace for Prometheus, Loki, and Tempo. The intended management-cluster namespace is platform-radar.
The current local validation path also assumes a Traefik ingress class, an ArgoCD application for the demo workloads, and an ArgoCD application for the local Platform Radar deployment itself.
The current reference environment assumes a management Kind context named kind-agentic-local, optional source contexts such as kind-platform-radar-prod, a local Gitea instance on http://localhost:3000, a local ArgoCD GUI typically port-forwarded to https://127.0.0.1:8085, a local Traefik route using https://copilot.zondermoeite.nl, and a local development server on http://127.0.0.1:8000. These values are not universal defaults. They are the current project reference and should be updated in this documentation if the local environment changes.
In the current example environment, Gitea runs locally on port 3000, the local Platform Radar app can run on port 8000, the in-cluster service can be reached through the local Traefik route https://copilot.zondermoeite.nl, and ArgoCD is commonly port-forwarded on 8085 for local inspection.
FastAPI documentation is also available in the current setups. Locally, Swagger UI is usually available on the app server under /docs and ReDoc under /redoc. In the management cluster, the final live route should be updated together with the approved ingress host and should not be assumed to remain the local copilot.zondermoeite.nl path.
These are example values from the current reference setup. They are useful for onboarding, but they are not the product contract. If the environment changes, this section should be updated together with the deployment and local-lab instructions.
This cluster is the practical onboarding path for a developer who wants to get a working local environment without guessing the order of operations.
Before local use, engineers should review and adjust the environment values in .env. The most important settings are the PostgreSQL connection values, Kubernetes context or in-cluster mode, LLM provider and model settings, Mistral API or local LLM endpoint details, ArgoCD connection details, VAPID keys for push notifications, and the paths used for environment profiles and prompt files.
They should also understand that prompt files are stored separately from the code and loaded at runtime during the pilot. If the local stack changes, for example a different Kind cluster name, a different Gitea host, a different LLM endpoint, or a different ingress host, those changes should be reflected in the configuration and in this documentation.
The first settings to review locally are the LLM values, because engineers may run Ollama or another provider instead of the current Mistral Small reference setup. They should also verify the PostgreSQL host, database, username, and password values, the Kubernetes context assumptions, the ArgoCD server and token settings, the ingress host used in the local lab, and the VAPID values if push notifications are being tested. If the engineer wants the copilot to run inside the cluster instead of only locally, they should also verify the cluster-side ConfigMap and Secret values used by the Kubernetes deployment manifests.
A new engineer should build the local environment in a clear order. First install the local tooling and clone the repository. Then prepare the Python environment and the local .env file. After that, make sure Docker and Kind are available, create or reuse the Kind cluster, and deploy the cluster-side dependencies such as ArgoCD, Traefik, observability, and the demo workloads. Then make sure Gitea is available for local GitOps testing. Finally, run the copilot locally or deploy it into the Kind cluster, depending on the validation path being used.
This order matters because the copilot becomes significantly more useful once the cluster-side dependencies and GitOps path exist. Running only the local web app is enough for early UI or prompt iteration, but it is not enough to validate the full pilot workflow.
This cluster covers the repeatable way to push a new pilot build through the local GitOps path and validate what the running environment exposes back to the copilot.
The current local pilot now has a defined release flow instead of relying on ad-hoc rebuilds. The intended path is: generate a new local image tag and build ID, build the Docker image, load it into the Kind cluster, update the copilot GitOps manifests with the new image tag and build ID, push those manifest changes to the local Gitea repository, and let ArgoCD reconcile the new version.
The repository includes a release helper script for this path in scripts/release_local_pilot.py. This is the preferred local pilot release route because it keeps image updates, manifest updates, and frontend cache-busting aligned instead of changing them by hand in separate steps.
The pilot now exposes a runtime discovery endpoint at /api/runtime/discovery. This is useful for validating what the copilot can discover live from the current environment, instead of assuming that every version, source, or controller state has been manually documented correctly.
The in-app documentation is meant for operator and pilot guidance. The current engineering handover lives in the repository under docs/, starting with docs/README.md. The most important engineering documents are docs/architecture.md, docs/technical-spec.md, docs/source-and-permission-model.md, docs/security-model.md, docs/operations-runbook.md, and docs/management-cluster-landing.md.
Backlog 1 and Backlog 2 are also documented in the repo. Backlog 1 describes the current single-agent improvements. Backlog 2 describes the future coordinator and MCP direction, but that runtime architecture is not active in the current app.