Methodology · updated Jun 26, 2026
How we test
Every verdict on this site comes from actually running the skill — a real install, a real task, and notes on what happened. Here's the process, so you can judge our judgment.
The process
- Clean install. We follow the author's own instructions on a fresh setup. If we can't get it working from the README, neither will you — that's a finding, not an inconvenience.
- Trigger check. Skills only matter if they activate. We probe the prompts the skill claims to handle and note misfires in both directions.
- Real task, real data. Spreadsheets with broken headers, transcripts with crosstalk, codebases with legacy corners. Demo-data performance is marketing; we test the Tuesday-afternoon version of the job.
- Comparison against baseline. The question is never "did it produce output" — it's "is this better than Claude without the skill." If the delta isn't clear, it doesn't pass.
- Blind review where it matters. Writing skills get judged by editors who don't know AI was involved. Sales emails get live-send tests. Numbers get verified row-by-row.
What the verdicts mean
- Tested · Works
- Installed from the author's instructions, triggered correctly, and beat the no-skill baseline on a real task. The test notes say exactly what we ran.
- Works with setup
- Delivers, but not out of the box — it needs configuration, a companion skill, or a connected integration first. The listing says precisely what.
- In test queue
- Listed because it looks promising, but we haven't finished testing. No verdict is implied. Some skills (productivity, social) take a week or more by design.
The SkillProof Score
Tested skills get a score out of 10, computed from four weighted criteria. The weights reflect what actually matters: output quality is worth as much as everything else combined.
- Installs cleanly /5
- A newcomer following the author's own instructions gets a working skill on a fresh setup.
- Triggers reliably /5
- Activates on the prompts it claims to handle; doesn't fire on unrelated work.
- Output vs. baseline /10
- The heart of the score: on a real task, is the result clearly better than Claude without the skill?
- Docs & honesty /5
- Clear documentation, no hidden network calls, no prompt-injection surprises.
How skills get discovered
- Crawl. Every 12 hours our crawler sweeps GitHub — the claude-skills topics, SKILL.md files, new repositories — and files anything new into the discovery queue with its stars and activity.
- Triage. We prioritize the queue by traction and category gaps. Duplicates and abandoned repos are dropped; promising skills get listed as "in test queue."
- Test. The skill goes through the full process above and either earns a verdict and a score, or the test notes say exactly why it didn't.
How collections are formed
- Role collections (developers, marketers, sales…) hold the skills we'd install for a colleague in that role — pass-verdict skills first, ordered by score.
- Stack collections (like The Website Stack) cover one job end-to-end, listed in install order, and may include queued skills only when a layer has no tested option yet.
- Entry rule: a skill enters a collection when it scores ≥ 7.5 and beats the current holder of its slot, or fills an empty layer. Featured placement never buys a collection spot.
What we don't do
- We don't list skills we haven't at least begun testing. 73 listed, 45 verified, and the gap is labeled.
- We don't sell verdicts. Featured placement is available only to skills that already passed, and a featured skill that starts failing loses both.
- We don't hide failures. If a popular skill flunks, the test notes say so — that's the product.