How We Aggregate Running Shoe Reviews

The methodology behind every rating, review, and "best of" pick on The Shoe Nerds.

Most running shoe reviews tell you what one person thinks after a few weeks in a pair. We do something different. Every review, rating, and category pick on this site is built by aggregating what dozens of reviewers, testers, and real runners are actually saying — then synthesizing that into one clear, trustworthy verdict.

Here's exactly how it works.

Why Aggregation?

A single reviewer — even a great one — brings their own biomechanics, preferences, and blind spots to every shoe they test. A heel striker may underrate a shoe that shines for forefoot runners. A lightweight runner may miss durability issues that heavier runners encounter. A professional tester in controlled conditions may miss what happens at mile 400.

No single reviewer can cover all of this. But dozens of them, taken together, can.

The Shoe Nerds exists to do the work of reading all of them so you don't have to.

Our Sources

Before writing anything, we search across four categories of sources for every shoe:

Expert and lab reviewers — sources that test shoes under controlled conditions, measure energy return, stack height, torsional rigidity, and outsole grip. These include RunRepeat, Doctors of Running, WearTesters, and Sole Review.

Enthusiast reviewers — experienced runners who log real miles and write detailed, opinionated breakdowns. These include Believe in the Run, Road Trail Run, Running Shoes Guru, OutdoorGearLab, and others.

YouTube reviewers — video reviewers who often show wear patterns, flex tests, and on-foot footage that written reviews miss. We pull from Kofuzi, Ben Parkes, The Run Testers, EddBud, Fordy Runs, Seth James DeMoor, and more.

Real runners — user ratings and review threads from Running Warehouse, Fleet Feet, Dick's Sporting Goods, REI, Zappos, and Amazon. We also pull from Reddit communities including r/RunningShoeGeeks, r/running, and r/Ultramarathon, as well as LetsRun forums.

We do not write about a shoe until we have meaningful coverage across at least several of these source types. If a shoe only has early first-impression reviews, we wait.

How We Write Aggregated Reviews

Once we've gathered sources, we identify three things:

Trends — what do most reviewers agree on? If eight out of ten sources call a foam "lively and responsive," that's a trend. If two call it "too soft," that's a caveat worth noting.

Repeated praise — specific things that multiple reviewers love, often in similar language. These become the shoe's genuine strengths.

Recurring complaints — the most important signal. A complaint that appears across multiple reviewers, across different running styles and body types, is a real issue. A complaint that appears once may just reflect that reviewer's preferences.

We then write the review to reflect those trends honestly — including the caveats. A shoe that earns a 9.2/10 from us still has a "The caveat" section. That's not a bug. It's the point.

We do not claim firsthand testing. Every review on this site carries a disclaimer making this explicit. We are analysts, not testers. Our value is synthesis and objectivity — not personal experience.

How We Use AI — The Cycling Method

We use AI in our process. We want to be upfront about that — and more importantly, we want to explain exactly how, because the way we use it is what makes our content more reliable, not less.

We don't simply ask an AI to "write a review about shoe X." Anyone can do that. The output is generic, often inaccurate, and disconnected from what reviewers are actually saying. That's not what we do.

Our process is built around what we call the Cycling Method — a structured, multi-step loop that uses three different AI reasoning models, human verification checkpoints, and iterative refinement before anything is published. Here's how it works.

Phase 1 — Research (ChatGPT reasoning model)

A human analyst initiates the process with a detailed, structured research prompt that instructs the AI to search across specific source categories — lab reviewers, enthusiast reviewers, YouTube channels, Reddit communities, and retail user reviews — and return a structured research brief, not a draft article. The brief identifies consensus strengths, consensus weaknesses, reviewer disagreements, verified stats with source URLs, and an honest assessment of how much coverage exists for the shoe.

A human then verifies that brief against the original sources — confirming every stat against brand spec pages and retail listings, clicking through cited URLs to confirm they say what the brief claims, and flagging any fabricated quotes or unsupported claims. If corrections are needed, the analyst feeds them back into the AI with a correction prompt and the loop repeats until the research is clean.

Phase 2 — Independent Research Verification (Gemini reasoning model)

Once the ChatGPT research brief is verified, a second AI model — Gemini — conducts an independent research pass on the same shoe, searching across the same source categories without seeing the ChatGPT output first. This is deliberate: running two separate research models prevents the process from getting locked into one model's blind spots or biases. Where the two briefs agree, confidence is higher. Where they disagree, a human investigates. The Gemini brief goes through the same human verification checklist as the ChatGPT brief before either is passed to the writing phase.

Phase 3 — Writing (Claude reasoning model)

Once both research briefs are verified, Claude conducts its own independent research on the shoe before seeing either brief. Claude forms its own view first, then receives both verified research briefs and reconciles all three perspectives. Errors any single model makes are more likely to be caught by the others. Claude then writes the aggregated article using the combined, verified research.

A human reads the full draft, cross-checks it against both research briefs and the original sources, and flags anything that looks wrong or unsupported. Claude then runs a self-audit pass — explicitly asked to flag any claim it isn't confident is accurate — and corrections are fed back in. This cycle repeats until both the human verification checklist and the self-audit are clean.

After the article is verified, Claude runs a dedicated stat scoring step — using everything gathered across all three research phases to assign scores across eight characteristics, calibrated to lab data where available.

Nothing is published without human sign-off. The final step is always a human read against a pre-publish checklist — confirmed stats, verified sources, no firsthand testing language, no placeholder text.

Why three models? ChatGPT, Gemini, and Claude have different training data, different search approaches, and different blind spots. Running all three independently on the same shoe means each model's errors are far more likely to be caught by the others. A claim that appears across all three independent research passes has meaningfully higher confidence than one that only one model found. A discrepancy between models flags exactly where a human needs to look closely.

Why this matters: The AI handles what humans are genuinely bad at at scale — reading hundreds of reviews, tracking which claims appear across multiple sources, and identifying patterns across large volumes of text. The human handles what AI is genuinely bad at — knowing when a stat doesn't add up, recognizing when a claim contradicts a source, and applying judgment that comes from actually knowing the running shoe world.

We currently run one to two cycles per piece of content. As the methodology matures, we plan to expand — more cycles, deeper verification steps, more granular source tracking — to push accuracy further than any single human reviewer or unstructured AI output could achieve alone. Every version improvement is documented internally. This is version 2.0.

How We Calculate Aggregated Ratings

Every shoe in our database has a single aggregated rating out of 10. Here's how we arrive at it.

After the article draft is complete and verified, we run a dedicated final step in Claude: using everything gathered across both research phases, Claude is asked to calculate a weighted score based on how many sources cover the shoe, the consistency of sentiment across sources, the depth of testing (long-term reviews weighted higher than first impressions), and retail user ratings, which represent the largest sample of real-world runners.

We weight the score based on what the shoe is actually for. A race shoe is rated on how good it is as a race shoe — not on whether it's comfortable for recovery runs. A stability shoe is rated on how well it guides overpronators — not on how fast it feels at tempo pace.

Scores are rounded to one decimal place. We only publish a rating when we have enough source coverage to be confident in it. If coverage is thin, we note that.

How We Score Individual Stats

In addition to the overall rating, every shoe in our database carries scores across eight individual characteristics. These are derived during the same final step as the overall rating — Claude draws on everything gathered across both research phases to assign each one.

Overall Rating (1.0–10.0) — A weighted aggregate of overall reviewer sentiment, retail user scores, and expert consensus, calibrated to the shoe's intended use case.

Fit (Narrow / Medium / Wide) — Based on reviewer descriptions of the toe box, midfoot, and overall last shape, plus retail user fit comments. If reviewers consistently note a roomy forefoot, it's Wide. If they flag tightness or a tapered toe box, it's Narrow.

Softness (Soft / Medium / Firm) — Based on lab softness measurements where available (RunRepeat AC scores), combined with the language reviewers use to describe the underfoot feel. "Plush," "marshmallow," and "sinky" map to Soft. "Snappy," "responsive," and "firm" map to Firm.

Breathability (Breathable / Medium / Warm) — Based on reviewer comments about upper ventilation, lab breathability assessments where available, and user complaints about heat buildup. A shoe that gets flagged for warmth in summer reviews or has a dense, unventilated upper is Warm.

Traction (High / Medium / Low) — Based on outsole rubber type and coverage, lab grip measurements where available, and reviewer comments on wet-road or trail performance. Vibram Megagrip and Continental rubber typically score High. Thin or exposed EVA outsoles score Low.

Shock Absorption (High / Medium / Low) — Based on lab shock absorption measurements where available (RunRepeat SA scores), stack height, midsole compound, and reviewer descriptions of impact protection. High stack supercritical foams typically score High. Low-stack or firm-midsole shoes score Low.

Durability (Very Good / Good / Decent / Bad) — Based on reviewer reports of outsole wear rates, midsole compression over time, and upper durability. Shoes that reviewers note significant wear before 200 miles score Bad. Shoes consistently praised for longevity score Very Good.

Energy Return (Very High / High / Medium / Low) — Based on lab energy return measurements where available (RunRepeat % energy return), midsole compound (PEBA/supercritical TPU score higher than EVA), and reviewer descriptions of propulsion and bounce.

If Claude does not have sufficient source information to confidently assign a score for a specific stat, it searches for targeted additional sources before assigning a value — or flags the stat as unreviewed rather than guessing. A dash in our database means we don't have enough coverage to be confident, not that the stat doesn't exist.

How We Pick the "Best Of"

Our "Best Of" articles — Best Daily Trainers of 2026, Best Race Shoes, and so on — follow a separate research process. For each category, we search across the same source list with category-specific queries and identify which shoes appear most frequently across top-pick lists.

The winner is the shoe that:

Appears most frequently across expert top-pick lists
Has the highest aggregated rating across sources
Is currently available — not discontinued or superseded
Has strong user review consensus — not just press attention

We pick one shoe per category. If the consensus is genuinely split between two shoes, we pick the one with the higher aggregated rating and explain the tradeoff in the article.

A note on guide articles: Our Best Of guides and deep-dive articles use an AI aggregation process similar to the one described above, adapted to the type of content. A "Best Race Shoes of 2026" guide involves the same broad source research and human verification, but applied across multiple shoes and category-level consensus rather than a single shoe review. The depth of the cycle varies based on the complexity and scope of the article.

What We Don't Do

We don't accept payments from brands to influence reviews or picks. We don't write about shoes we can't find meaningful independent coverage for. We don't fabricate sources or invent reviewer quotes. If we link to a source, it's a source we actually consulted.

Why This Matters

The running shoe market is noisy. New shoes launch every month. Brands spend heavily on marketing. Review sites face pressure from affiliate relationships and sponsored content. It's genuinely hard to know what to believe.

Our goal is to be the source you can trust precisely because we don't have a personal stake in any particular shoe winning — and because our process is designed from the ground up to catch errors, surface consensus, and represent what the broader running community actually thinks.

We just read what everyone else is saying, find the signal in the noise, and tell you what it means.

That's The Shoe Nerds.

Take our quiz for personalized shoe recommendations based on how you actually run.

Take the Quiz →