Methodology · updated Jun 26, 2026

How we test

Every verdict on this site comes from actually running the skill — a real install, a real task, and notes on what happened. Here's the process, so you can judge our judgment.

The process

Clean install. We follow the author's own instructions on a fresh setup. If we can't get it working from the README, neither will you — that's a finding, not an inconvenience.
Trigger check. Skills only matter if they activate. We probe the prompts the skill claims to handle and note misfires in both directions.
Real task, real data. Spreadsheets with broken headers, transcripts with crosstalk, codebases with legacy corners. Demo-data performance is marketing; we test the Tuesday-afternoon version of the job.
Comparison against baseline. The question is never "did it produce output" — it's "is this better than Claude without the skill." If the delta isn't clear, it doesn't pass.
Blind review where it matters. Writing skills get judged by editors who don't know AI was involved. Sales emails get live-send tests. Numbers get verified row-by-row.

What the verdicts mean

Tested · Works: Installed from the author's instructions, triggered correctly, and beat the no-skill baseline on a real task. The test notes say exactly what we ran.
Works with setup: Delivers, but not out of the box — it needs configuration, a companion skill, or a connected integration first. The listing says precisely what.
In test queue: Listed because it looks promising, but we haven't finished testing. No verdict is implied. Some skills (productivity, social) take a week or more by design.

The SkillProof Score

Tested skills get a score out of 10, computed from four weighted criteria. The weights reflect what actually matters: output quality is worth as much as everything else combined.

Installs cleanly /5: A newcomer following the author's own instructions gets a working skill on a fresh setup.
Triggers reliably /5: Activates on the prompts it claims to handle; doesn't fire on unrelated work.
Output vs. baseline /10: The heart of the score: on a real task, is the result clearly better than Claude without the skill?
Docs & honesty /5: Clear documentation, no hidden network calls, no prompt-injection surprises.

How skills get discovered

Crawl. Every 12 hours our crawler sweeps GitHub — the claude-skills topics, SKILL.md files, new repositories — and files anything new into the discovery queue with its stars and activity.
Triage. We prioritize the queue by traction and category gaps. Duplicates and abandoned repos are dropped; promising skills get listed as "in test queue."
Test. The skill goes through the full process above and either earns a verdict and a score, or the test notes say exactly why it didn't.

How collections are formed

Role collections (developers, marketers, sales…) hold the skills we'd install for a colleague in that role — pass-verdict skills first, ordered by score.
Stack collections (like The Website Stack) cover one job end-to-end, listed in install order, and may include queued skills only when a layer has no tested option yet.
Entry rule: a skill enters a collection when it scores ≥ 7.5 and beats the current holder of its slot, or fills an empty layer. Featured placement never buys a collection spot.

What we don't do

We don't list skills we haven't at least begun testing. 73 listed, 45 verified, and the gap is labeled.
We don't sell verdicts. Featured placement is available only to skills that already passed, and a featured skill that starts failing loses both.
We don't hide failures. If a popular skill flunks, the test notes say so — that's the product.