Resources and blog

Field notes for observability and drift-aware operations.

Starter explainers, incident playbooks, and architecture notes for teams evaluating observability platforms, AI observability, retrieval drift, system drift detection, OpenTelemetry, and SRE tooling.

Playbook

How to start private AI observability with metadata-only telemetry

A practical guide to monitoring production AI through prompt hashes, retrieval metadata, latency, fallback, escalation, and drift signals without exporting raw sensitive content.

7 min readMay 27, 2026

Playbook

How to monitor retrieval drift in production RAG systems

A practical playbook for detecting retrieval drift through source coverage, no-source rate, fallback behavior, latency, and answer quality after knowledge-base or ranking changes.

7 min readMay 25, 2026

Playbook

How to turn red-team findings into production guardrails

A practical playbook for turning jailbreaks, grounding failures, and policy misses into runtime controls, eval reruns, alerts, and audit evidence.

7 min readMay 24, 2026

Playbook

What belongs in an AI release evidence package

A practical release checklist for production AI teams covering prompt, model, retrieval, guardrail, eval, cost, latency, and human-review evidence.

7 min readMay 23, 2026

Playbook

How AI observability supports NIST AI RMF

A practical guide to mapping NIST AI RMF work into production AI telemetry, evals, guardrails, drift evidence, and operator review.

7 min readMay 21, 2026

Playbook

AI observability for production teams

A practical guide to AI observability covering prompt drift, model drift, retrieval quality, guardrail evidence, eval results, cost, latency, and incident-ready operations.

8 min readApr 28, 2026

Explainer

What is observability?

A practical definition of observability for engineering teams that need to understand production systems through logs, metrics, traces, alerts, incidents, and change context.

5 min readApr 27, 2026

Guide

Logs vs metrics vs traces

How logs, metrics, and traces differ, when to use each signal, and why production teams need all three for reliable incident detection and response.

6 min readApr 27, 2026

Explainer

What is system drift?

A plain-language guide to system drift, how it appears in production telemetry, and how deterministic drift detection can help teams find issues before they become incidents.

6 min readApr 27, 2026

Playbook

How to reduce MTTR

A practical incident management guide for reducing mean time to recovery by connecting telemetry, alerts, ownership, timelines, and change context.

6 min readApr 27, 2026

Guide

How engineering teams detect production incidents earlier

How engineering and SRE teams can detect production incidents earlier by combining observability, system drift detection, OpenTelemetry context, and incident management workflows.

7 min readApr 27, 2026

Executive evaluation

Review Driftdog against your enterprise AI control requirements.

Walk through deployment posture, baseline evaluation logic, audit evidence, drift detection, hallucination-risk controls, and the operating record required for regulated AI systems.

Schedule an evaluation session