The metrics behind AI-era engineering
Twenty-five signals that separate durable delivery from expensive motion.
Your team ships more with AI. The real question is whether that code survives. DevXOS analyzes your Git history and PR data to tell you.
Traditional velocity metrics were built for a world where humans wrote all the code. This deck walks you through what we built to replace them.
What DevXOS will never do
Ten product principles guard the engineering analytics we build. These four matter most at the pitch.
DevXOS never ranks or scores developers. Every metric describes repositories, teams, and dynamics — never who wrote what fastest.
No IDE plugin, no proprietary telemetry, no vendor lock-in. We read your Git history and PR data — that's it.
Every metric must hold up in plain language. If an engineering leader can't understand why a score exists, the score doesn't exist.
Engineering analytics can easily become surveillance. DevXOS must be safe for teams to adopt. If a feature reduces trust, it doesn't ship.
Foundations — signal vs noise
Four metrics always present in every DevXOS run. The baseline.
Does your code survive its first week?
The single most important number in DevXOS.
Of every file your team touched, the fraction that was NOT modified again within the churn window. Files touched once count as stabilized.
A stabilization ratio near 1.0 means changes persist — real work. Near 0.0 means rework dressed up as delivery. This is signal vs noise in one number.
The cost of incomplete first tries
Count and weight of files that needed rapid re-editing.
Files modified 2+ times where a consecutive pair of edits falls inside the churn window. Plus the total lines touched across those re-edits.
Churn is the tax your team pays on shaky first implementations. Unlike velocity, it goes up when things go wrong — and it's visible per file.
The bluntest signal something broke
How often does your team un-ship what it shipped?
Commits matching revert patterns, as a fraction of total. Attribution credits the ORIGIN of the reverted code — not who wrote the revert.
Reverts are rare but unambiguous. Segmented by origin and AI tool, they answer: which tool's code gets rolled back? That comparison is hard to argue with.
Classification — intent & origin
Before you can compare, you have to separate. We separate by what each commit was for, and who wrote it.
Feature, fix, refactor, config
Every commit classified. Deterministically. No ML.
Conventional Commit prefixes first, keywords second, file-type heuristic third. Every commit gets an intent: FEATURE, FIX, REFACTOR, CONFIG, or UNKNOWN.
"We're shipping fast" means nothing if 60% is FIX. Intent distribution turns a flat commit count into a picture of what the team is actually spending time on.
Human, AI-assisted, or bot
No guessing. We read co-author tags and author patterns.
Co-author matches Copilot, Claude, Cursor, Codeium, Tabnine, Amazon Q, Gemini, or Windsurf → AI_ASSISTED. Known bot names → BOT. Everything else → HUMAN.
Every single other metric in DevXOS can be segmented by origin. This is the dimension that unlocks AI impact analysis — without surveys, without self-report.
How much AI work is already visible
The other half of this number is your attribution gap.
AI-attributed commits as a percentage of all non-bot commits. A proxy for how much of the actual AI usage is declared in the git metadata.
Compliance officers, AI governance leads, and skeptical CTOs all ask the same thing: how much AI is in our code? This is the answer you can defend in a meeting.
Quality & durability — does it last?
From commit shape to line survival. How code actually ages.
AI code has a shape. See it.
Focused, spread, bulk, or surgical — by origin.
Median files, lines per file, and directory spread per commit, grouped by origin. Each origin's typical shape emerges: deep, wide, thin, or broad.
AI-generated commits tend to be wide & shallow (spread) — scaffolding, boilerplate. Human commits lean surgical or focused. This pattern is measurable, not anecdotal.
Does AI code break faster?
Measured in hours — from first commit to rework.
Median time between consecutive modifications of the same file within the churn window. Attribution credits the ORIGINAL commit, not the fix.
Buckets: fast < 72h (probably obvious bugs), medium 72–168h (caught in review/prod), slow > 168h (subtle). Compare AI vs human fast-rework rates side by side.
One bad commit, three follow-ups
Blast radius of code that doesn't quite land.
A trigger commit followed by 1+ FIX commits on shared files within the churn window. Depth = number of follow-up fixes. Attribution credits the trigger's origin.
A 30% cascade rate means almost a third of your trigger commits break something. Segmented by AI tool, this tells you which tool's output carries the highest cleanup cost.
How much AI code survives the quarter
Git blame at HEAD. The ultimate survival test.
For each origin and each AI tool: lines introduced vs lines still present at HEAD. Survival rate. Median age of surviving lines in days.
Our internal benchmark found AI-attributed lines survive at 79% vs human 64% — on primed repos. Durability is the counter-intuitive headline: AI code may last longer when attributed properly.
Does AI code pass review?
Single-pass PRs vs rounds of changes-requested.
Per origin and per AI tool: fraction of commits that landed via a PR; of those, fraction merged with zero CHANGES_REQUESTED; median review rounds.
Two different AI tools can produce code that reviews very differently. Acceptance rate quantifies that — it's the missing link between "AI productivity" claims and peer-reviewed outcomes.
The full journey, per origin
Committed → In PR → Stabilized → Still alive.
Four-stage delivery funnel computed per origin, with conversion rates between each step. Composes origin distribution, acceptance, stabilization, and durability.
AI might crush commits and pass review — and still drop off at stabilization. The funnel reveals exactly where each origin wins and where it leaks. One chart, full story.
Detection — what's hiding in plain sight
Patterns you can't see until you look for them.
The AI work nobody tagged
Human-classified commits with AI-shaped velocity patterns.
Flags HUMAN commits hitting 2+ of: 3 commits in 2h, 100+ LOC, < 30min since prev, 5+ files. We never call it AI — we surface the gap for review.
If ai_detection_coverage says 40% and attribution gap flags another 30% of human commits as suspect, your real AI footprint is double what your governance dashboard shows.
Copy-paste went 8× since AI
GitClear 2025. Measured. Now check yours.
Commits containing 5+ contiguous identical non-trivial lines across multiple files. Rate per commit, median block size, segmented by origin and by AI tool.
Copy-paste is the fast lane to entropy: the same bug, in five places, forever. A rising duplicate rate is the leading indicator of debt you haven't paid yet.
Real refactors look different
And we can tell the difference at the diff level.
Percentage of changed lines that were moved between files in the same commit. Refactoring ratio = moved / (moved + duplicated) — a code-health index.
Moved code dropped from 24% to 9.5% post-AI in industry data. When your refactoring ratio rises, the team is actually extracting and organizing — not just generating more.
Improving mature code, or churning this month's?
The age of the lines your team is rewriting.
Git-blame buckets the age of each line being modified: under 2 weeks, 2–4 weeks, 1–12 months, 1–2 years, 2+ years. Plus percentage revising new code vs mature code.
GitClear found 79% of revised lines in 2024 were less than a month old. If most of your team's effort is re-churning fresh code, you're not improving the codebase — you're spinning.
The 14-day canary
Code that gets re-edited inside two weeks.
Files that received new code and were modified again within 14 or 28 days. Segmented by origin and by AI tool, attributed to the INTRODUCING commit.
Fresh code that gets re-touched within two weeks usually means the first try missed. A 2-week rate trending up is the earliest quality alarm you can wire to a dashboard.
Whose code attracts the bugs?
Fair share vs disproportionate share.
For each FIX commit's target files, credit the origin of the last non-fix commit. Compute code share vs fix share vs disproportionality (fix/code).
If AI wrote 30% of commits but attracts 50% of fixes, disproportionality = 1.67 — the clearest signal that AI-written code costs more to maintain than it first appears.
Time & structure — where and when
Your repo has zones. Your year has turning points. We map both.
Your repo has zones
Some stable. Some on fire. Name them.
Per-directory rollup (depth 2 by default) of files touched, stabilized, churn events, and stabilization ratio. Directories classified stable ≥ 0.80, volatile < 0.50.
"The backend is a mess" is a feeling. Stability map turns it into `src/payments/` at 0.41 vs `src/shared/` at 0.92. That's something you can fix, staff, or document.
Name the files that cost you
Top churning files with their chain. And the pairs that move together.
Top 10 churning files with their full chain (e.g., feat → fix → fix → refactor). Plus file couplings: pairs that co-occur in commits with high coupling rate.
Aggregate numbers tell you something is wrong. Churn detail tells you which file, what pattern, and what else changes with it. Now you can fix the root cause, not the symptom.
Every week tells a story
Weekly breakdown + four pattern detectors.
ISO-week rollup: commits, LOC, intent mix, origin mix, stabilization, churn, PRs merged. Patterns auto-detected: burst_then_fix, quiet_period, ai_ramp, intent_shift.
When a metric jumped, you need to know why. The timeline + patterns layer gives you an annotated story — not just numbers, but the moments that made them.
Quantify review friction
Before it becomes a complaint in the retro.
Median time-to-merge, median PR size (files and lines), median review rounds, and single-pass rate — the fraction of PRs merged without a CHANGES_REQUESTED review.
Single-pass rate is the PR metric that correlates most with team satisfaction. Combined with time-to-merge, it tells you whether review is a gate or a bottleneck.
The mix of how your team writes
Added, deleted, updated, moved, duplicated.
Lightweight five-bucket taxonomy of line operations per commit, built from diff content plus duplicate and move detectors. Overall plus per-origin breakdown.
A team dominated by `added` is growing fast; by `updated`, iterating; by `moved`, refactoring; by `duplicated`, accumulating debt. Shape of work, in one chart.
Motion — velocity without blindness
How fast is the team — and is that speed paid for in durability?
Speed means nothing if durability drops
Commits/week, lines/week, and the correlation with quality.
14-day windows of commits/week and lines/week. Trend classified accelerating, stable, decelerating. Correlated with per-window stabilization — the durability connection.
Accelerating with durability steady = real progress. Accelerating while stabilization drops = you are shipping noise faster. Velocity alone lies. Velocity + durability tells the truth.
The day AI changed your metrics
Detected automatically. Pre vs post, side by side.
Finds the inflection point where AI-attributed commits began appearing. Splits history into pre-adoption and post-adoption, each with a full ReportMetrics snapshot.
Before-and-after proof. Stabilization went from 0.71 to 0.84 since the Copilot rollout? That's a number you can put on a slide. Reversed? That's a number you need to look at fast.
See your own numbers
One CLI command. No servers. No SaaS telemetry on private code.
DevXOS runs locally against your Git history and GitHub PRs, produces a report in minutes, and optionally pushes the metrics to your tenant for cross-repo views.
Everything in this deck is open and auditable. Read the methodology, run it on your own repo, and decide for yourself whether the signal is real.