DevXOS Metrics Deck

01 / 38

DEVXOS · METRICS DECK

The metrics behind AI-era engineering

Twenty-nine signals that separate durable delivery from expensive motion.

What it measures

Your team ships more with AI. The real question is whether that code survives. DevXOS analyzes your Git history and PR data to tell you.

Why it matters

Traditional velocity metrics were built for a world where humans wrote all the code. This deck walks you through what we built to replace them.

PRINCIPLES · BUILT IN

What DevXOS will never do

Ten product principles guard the engineering analytics we build. These four matter most at the pitch.

Systems, never individuals

DevXOS never ranks or scores developers. Every metric describes repositories, teams, and dynamics — never who wrote what fastest.

Vendor-agnostic intelligence

No IDE plugin, no proprietary telemetry, no vendor lock-in. We read your Git history and PR data — that's it.

Explainable, or it doesn't ship

Every metric must hold up in plain language. If an engineering leader can't understand why a score exists, the score doesn't exist.

Trust is a product feature

Engineering analytics can easily become surveillance. DevXOS must be safe for teams to adopt. If a feature reduces trust, it doesn't ship.

PART 1

Foundations — signal vs noise

Four metrics always present in every DevXOS run. The baseline.

FOUNDATIONS · STABILIZATION

Does your code survive its first week?

The single most important number in DevXOS.

What it measures

Of every file your team touched, the fraction that was NOT modified again within the churn window. Files touched once count as stabilized.

Why it matters

A stabilization ratio near 1.0 means changes persist — real work. Near 0.0 means rework dressed up as delivery. This is signal vs noise in one number.

>stabilization_ratio· float 0.0–1.0

example: 0.83

FOUNDATIONS · CHURN

The cost of incomplete first tries

Count and weight of files that needed rapid re-editing.

What it measures

Files modified 2+ times where a consecutive pair of edits falls inside the churn window. Plus the total lines touched across those re-edits.

Why it matters

Churn is the tax your team pays on shaky first implementations. Unlike velocity, it goes up when things go wrong — and it's visible per file.

>churn_events · churn_lines_affected· int · int

example: 42 · 3 120

FOUNDATIONS · REVERT

The bluntest signal something broke

How often does your team un-ship what it shipped?

What it measures

Commits matching revert patterns, as a fraction of total. Attribution credits the ORIGIN of the reverted code — not who wrote the revert.

Why it matters

Reverts are rare but unambiguous. Segmented by origin and AI tool, they answer: which tool's code gets rolled back? That comparison is hard to argue with.

>revert_rate · revert_by_tool· float · Record

example: 0.024 · { claude: 3, copilot: 7 }

PART 2

Classification — intent & origin

Before you can compare, you have to separate. We separate by what each commit was for, and who wrote it.

CLASSIFICATION · INTENT

Feature, fix, refactor, config

Every commit classified. Deterministically. No ML.

What it measures

Conventional Commit prefixes first, keywords second, file-type heuristic third. Every commit gets an intent: FEATURE, FIX, REFACTOR, CONFIG, or UNKNOWN.

Why it matters

"We're shipping fast" means nothing if 60% is FIX. Intent distribution turns a flat commit count into a picture of what the team is actually spending time on.

>commit_intent_distribution· Record<intent, count>

example: { FEATURE: 45, FIX: 32, REFACTOR: 12, CONFIG: 8 }

CLASSIFICATION · ORIGIN

Human, AI-assisted, or bot

No guessing. We read co-author tags and author patterns.

What it measures

Co-author matches Copilot, Claude, Cursor, Codeium, Tabnine, Amazon Q, Gemini, or Windsurf → AI_ASSISTED. Known bot names → BOT. Everything else → HUMAN.

Why it matters

Every single other metric in DevXOS can be segmented by origin. This is the dimension that unlocks AI impact analysis — without surveys, without self-report.

>commit_origin_distribution· Record<origin, count>

example: { HUMAN: 210, AI_ASSISTED: 140, BOT: 28 }

CLASSIFICATION · COVERAGE

How much AI work is already visible

The other half of this number is your attribution gap.

What it measures

AI-attributed commits as a percentage of all non-bot commits. A proxy for how much of the actual AI usage is declared in the git metadata.

Why it matters

Compliance officers, AI governance leads, and skeptical CTOs all ask the same thing: how much AI is in our code? This is the answer you can defend in a meeting.

62.4% of non-bot commitsattributed to AI tools

PART 3

Quality & durability — does it last?

From commit shape to line survival. How code actually ages.

QUALITY · SHAPE

AI code has a shape. See it.

Focused, spread, bulk, or surgical — by origin.

What it measures

Median files, lines per file, and directory spread per commit, grouped by origin. Each origin's typical shape emerges: deep, wide, thin, or broad.

Why it matters

AI-generated commits tend to be wide & shallow (spread) — scaffolding, boilerplate. Human commits lean surgical or focused. This pattern is measurable, not anecdotal.

>commit_shape_dominant· enum

example: "spread" | "focused" | "bulk" | "surgical"

QUALITY · FIX LATENCY

Does AI code break faster?

Measured in hours — from first commit to rework.

What it measures

Median time between consecutive modifications of the same file within the churn window. Attribution credits the ORIGINAL commit, not the fix.

Why it matters

Buckets: fast < 72h (probably obvious bugs), medium 72–168h (caught in review/prod), slow > 168h (subtle). Compare AI vs human fast-rework rates side by side.

>fix_latency_median_hours· float hours

example: 28.4

QUALITY · CASCADES

One bad commit, three follow-ups

Blast radius of code that doesn't quite land.

What it measures

A trigger commit followed by 1+ FIX commits on shared files within the churn window. Depth = number of follow-up fixes. Attribution credits the trigger's origin.

Why it matters

A 30% cascade rate means almost a third of your trigger commits break something. Segmented by AI tool, this tells you which tool's output carries the highest cleanup cost.

>cascade_rate · cascade_median_depth· float · float

example: 0.18 · 2.0

QUALITY · DURABILITY

How much AI code survives the quarter

Git blame at HEAD. The ultimate survival test.

What it measures

For each origin and each AI tool: lines introduced vs lines still present at HEAD. Survival rate. Median age of surviving lines in days.

Why it matters

Our internal benchmark found AI-attributed lines survive at 79% vs human 64% — on primed repos. Durability is the counter-intuitive headline: AI code may last longer when attributed properly.

survival_rate (AI_ASSISTED)0.79

0.0 never survives · 1.0 all survives

QUALITY · ACCEPTANCE

Does AI code pass review?

Single-pass PRs vs rounds of changes-requested.

What it measures

Per origin and per AI tool: fraction of commits that landed via a PR; of those, fraction merged with zero CHANGES_REQUESTED; median review rounds.

Why it matters

Two different AI tools can produce code that reviews very differently. Acceptance rate quantifies that — it's the missing link between "AI productivity" claims and peer-reviewed outcomes.

>acceptance_by_origin.AI_ASSISTED.single_pass_rate· float 0.0–1.0

example: 0.71

QUALITY · FUNNEL

The full journey, per origin

Committed → In PR → Stabilized → Still alive.

What it measures

Four-stage delivery funnel computed per origin, with conversion rates between each step. Composes origin distribution, acceptance, stabilization, and durability.

Why it matters

AI might crush commits and pass review — and still drop off at stabilization. The funnel reveals exactly where each origin wins and where it leaks. One chart, full story.

>origin_funnel.AI_ASSISTED· FunnelStage[]

example: [Committed → InPR → Stabilized → Surviving]

PART 4

Detection — what's hiding in plain sight

Patterns you can't see until you look for them.

DETECTION · ATTRIBUTION

The AI work nobody tagged

Human-classified commits with AI-shaped velocity patterns.

What it measures

Flags HUMAN commits hitting 2+ of: 3 commits in 2h, 100+ LOC, < 30min since prev, 5+ files. We never call it AI — we surface the gap for review.

Why it matters

If ai_detection_coverage says 40% and attribution gap flags another 30% of human commits as suspect, your real AI footprint is double what your governance dashboard shows.

>attribution_gap.flagged_pct· float 0.0–100.0

example: 28.6

DETECTION · DUPLICATES

Copy-paste went 8× since AI

GitClear 2025. Measured. Now check yours.

What it measures

Commits containing 5+ contiguous identical non-trivial lines across multiple files. Rate per commit, median block size, segmented by origin and by AI tool.

Why it matters

Copy-paste is the fast lane to entropy: the same bug, in five places, forever. A rising duplicate rate is the leading indicator of debt you haven't paid yet.

>duplicate_block_rate· float 0.0–1.0

example: 0.14

DETECTION · MOVES

Real refactors look different

And we can tell the difference at the diff level.

What it measures

Percentage of changed lines that were moved between files in the same commit. Refactoring ratio = moved / (moved + duplicated) — a code-health index.

Why it matters

Moved code dropped from 24% to 9.5% post-AI in industry data. When your refactoring ratio rises, the team is actually extracting and organizing — not just generating more.

>refactoring_ratio· float 0.0–1.0

example: 0.62

DETECTION · PROVENANCE

Improving mature code, or churning this month's?

The age of the lines your team is rewriting.

What it measures

Git-blame buckets the age of each line being modified: under 2 weeks, 2–4 weeks, 1–12 months, 1–2 years, 2+ years. Plus percentage revising new code vs mature code.

Why it matters

GitClear found 79% of revised lines in 2024 were less than a month old. If most of your team's effort is re-churning fresh code, you're not improving the codebase — you're spinning.

79% of revisionson code < 1 month old — industry 2024

DETECTION · NEW-CODE CHURN

The 14-day canary

Code that gets re-edited inside two weeks.

What it measures

Files that received new code and were modified again within 14 or 28 days. Segmented by origin and by AI tool, attributed to the INTRODUCING commit.

Why it matters

Fresh code that gets re-touched within two weeks usually means the first try missed. A 2-week rate trending up is the earliest quality alarm you can wire to a dashboard.

>new_code_churn_rate_2w· float 0.0–1.0

example: 0.31

DETECTION · FIX TARGETING

Whose code attracts the bugs?

Fair share vs disproportionate share.

What it measures

For each FIX commit's target files, credit the origin of the last non-fix commit. Compute code share vs fix share vs disproportionality (fix/code).

Why it matters

If AI wrote 30% of commits but attracts 50% of fixes, disproportionality = 1.67 — the clearest signal that AI-written code costs more to maintain than it first appears.

>fix_target_by_origin.AI_ASSISTED.disproportionality· float (1.0 = fair share)

example: 1.67

PART 5

Time & structure — where and when

Your repo has zones. Your year has turning points. We map both.

STRUCTURE · STABILITY MAP

Your repo has zones

Some stable. Some on fire. Name them.

What it measures

Per-directory rollup (depth 2 by default) of files touched, stabilized, churn events, and stabilization ratio. Directories classified stable ≥ 0.80, volatile < 0.50.

Why it matters

"The backend is a mess" is a feeling. Stability map turns it into `src/payments/` at 0.41 vs `src/shared/` at 0.92. That's something you can fix, staff, or document.

>stability_map[]· DirectoryMetrics[]

example: src/payments/ — 0.41 · src/shared/ — 0.92

STRUCTURE · CHURN DETAIL

Name the files that cost you

Top churning files with their chain. And the pairs that move together.

What it measures

Top 10 churning files with their full chain (e.g., feat → fix → fix → refactor). Plus file couplings: pairs that co-occur in commits with high coupling rate.

Why it matters

Aggregate numbers tell you something is wrong. Churn detail tells you which file, what pattern, and what else changes with it. Now you can fix the root cause, not the symptom.

>churn_top_files[0].chain· string

example: "feat→fix→fix→fix"

TIME · TIMELINE

Every week tells a story

Weekly breakdown + four pattern detectors.

What it measures

ISO-week rollup: commits, LOC, intent mix, origin mix, stabilization, churn, PRs merged. Patterns auto-detected: burst_then_fix, quiet_period, ai_ramp, intent_shift.

Why it matters

When a metric jumped, you need to know why. The timeline + patterns layer gives you an annotated story — not just numbers, but the moments that made them.

>activity_patterns[].pattern· enum

example: "ai_ramp" · week 10/14

TIME · PR LIFECYCLE

Quantify review friction

Before it becomes a complaint in the retro.

What it measures

Median time-to-merge, median PR size (files and lines), median review rounds, and single-pass rate — the fraction of PRs merged without a CHANGES_REQUESTED review.

Why it matters

Single-pass rate is the PR metric that correlates most with team satisfaction. Combined with time-to-merge, it tells you whether review is a gate or a bottleneck.

>pr_single_pass_rate · pr_median_time_to_merge_hours· float · float

example: 0.64 · 18.2

TIME · STUCK INVENTORY

The PRs nobody is looking at

A snapshot of open work that never moved.

What it measures

Counts non-draft, non-bot PRs still open right now. Reports median age, stale share (no review or commit in 14+ days), and abandonment share (60+ days). Broken down by intent and code origin.

Why it matters

Throughput metrics describe work that moved. This describes work that stopped. When AI generation outruns review capacity, this is the metric that catches it first — and the AI-vs-human gap shows up clearly.

>stale_open_pr_pct · stale_open_pr_pct_by_origin· float · {origin: float}

example: 0.31 · { AI_ASSISTED: 0.48, HUMAN: 0.22 }

STRUCTURE · OPERATIONS

The mix of how your team writes

Added, deleted, updated, moved, duplicated.

What it measures

Lightweight five-bucket taxonomy of line operations per commit, built from diff content plus duplicate and move detectors. Overall plus per-origin breakdown.

Why it matters

A team dominated by `added` is growing fast; by `updated`, iterating; by `moved`, refactoring; by `duplicated`, accumulating debt. Shape of work, in one chart.

>operation_dominant· enum

example: "updated"

PART 6

Motion — velocity without blindness

How fast is the team — and is that speed paid for in durability?

MOTION · VELOCITY

Speed means nothing if durability drops

Commits/week, lines/week, and the correlation with quality.

What it measures

14-day windows of commits/week and lines/week. Trend classified accelerating, stable, decelerating. Correlated with per-window stabilization — the durability connection.

Why it matters

Accelerating with durability steady = real progress. Accelerating while stabilization drops = you are shipping noise faster. Velocity alone lies. Velocity + durability tells the truth.

>velocity.trend · velocity.durability_correlation· enum · enum

example: "accelerating" · "decoupled"

MOTION · FLOW EFFICIENCY

Lifecycle time, split into active and wait

Throughput rose. Lead time didn't. This is why.

What it measures

Decomposes every merged PR into four phases — coding, awaiting first review, in review, awaiting merge — and classifies each as active or wait. AI compresses coding time; review queues stay flat. The ratio tells you where the time really goes.

Why it matters

Without this, AI looks like a win on commits/week while lead time flatlines. With it, you see exactly which queue is absorbing the speedup — and whether the gain is reaching production.

>flow_efficiency_median · median_time_to_first_review_hours· float · float

example: 0.34 · 18.6

MOTION · FLOW LOAD

How much work is in flight, not just shipping

AI expands WIP faster than it expands throughput.

What it measures

Per ISO week: how many PRs were simultaneously in flight, segmented by intent, alongside the count of distinct commit authors. A system-level WIP signal — never per person.

Why it matters

Healthy throughput with rising WIP and flat author concurrency is congestion in disguise. This is the metric that explains why a team feels busier when the dashboards look fine.

>flow_load[].wip_total · flow_load[].author_concurrency· int · int

example: 12 · 7

MOTION · ADOPTION

The day AI changed your metrics

Detected automatically. Pre vs post, side by side.

What it measures

Finds the inflection point where AI-attributed commits began appearing. Splits history into pre-adoption and post-adoption, each with a full ReportMetrics snapshot.

Why it matters

Before-and-after proof. Stabilization went from 0.71 to 0.84 since the Copilot rollout? That's a number you can put on a slide. Reversed? That's a number you need to look at fast.

>adoption_timeline.adoption_confidence· enum

example: "clear" · pre 0.71 → post 0.84

OPT-IN · DORA × ORIGIN

Connect Datadog, see DORA crossed with code origin

Optional integration. Vendor-agnostic by default.

What it measures

When a team opts in, DevXOS pulls Change Failure Rate, MTTR, lead time, deploy frequency and rollback rate from Datadog's DORA events, then joins them by commit SHA against local origin classification. Treats pending evaluations as their own bucket — no fake CFR while Datadog hasn't decided yet.

Why it matters

The closing answer to the question this deck opens with: does AI-assisted code survive in production? With Datadog connected, CFR and rollback rate by origin make the answer numerical — for the teams that choose to provide it.

>dora_cfr · dora_cfr_by_origin· float · {origin: {failed, evaluated, cfr}}

example: 0.08 · { AI_ASSISTED: 0.11, HUMAN: 0.06 }

See your own numbers

One CLI command. No servers. No SaaS telemetry on private code.

What it measures

DevXOS runs locally against your Git history and GitHub PRs, produces a report in minutes, and optionally pushes the metrics to your tenant for cross-repo views.

Why it matters

Everything in this deck is open and auditable. Read the methodology, run it on your own repo, and decide for yourself whether the signal is real.