How to Test Skills Before Publishing — The Full Test Pyramid
Published 21 April 2026 · 8 min read
Level 1 — state machine unit tests
Your state machine is a deterministic object. Unit tests exercise every transition: what does the skill do when invoked? When paused mid-action? When the input is an empty list? When an item is not findable? When the user says “stop”?
Target: 80%+ line coverage on the state-machine module, 100% branch coverage on safety transitions, explicit tests for every refusal condition.
Unit tests run in < 1 second and are the feedback loop you use during development. If the unit-test suite is slow, the habit of running it dies.
Level 2 — simulation-harness integration tests
The GeraSkills simulator provides:
- Realistic sensor feeds (camera, LiDAR, touch).
- Bounded physics (the robot cannot teleport; gravity exists).
- Scripted scenarios (kitchen with 12 items in known positions, living room with cluttered floor).
- Fault injection (sensor dropout, motor stall, network blip).
Write a fixture per scenario. Aim for 10-20 fixtures covering: happy path, partial-completion, refusal conditions, recovery from interruptions, fault-injection events.
Level 3 — real-hardware dry runs
No amount of simulation replaces touching real hardware. Rules:
- A human stays in the room with the e-stop in hand.
- Start with the lowest-risk scenarios (known-good fixtures).
- Escalate to edge cases only after happy-path passes three times consecutively.
- Log every run. Unusual behaviour gets a ticket.
- Never dry-run alone for anything involving kitchen implements, heavy items, or near other humans.
Aim for 30-50 hardware dry runs before inviting beta users. Most creators underinvest here.
Level 4 — closed beta
Recruit 5-20 real users (consent, NDA optional, demo units where possible). Give them the skill with an explicit “this is beta” label and a feedback form. Ask for:
- Completion rate across a real week.
- Failures and near-misses with photos where possible.
- Usability observations (the prompt text, the speed, the noise).
- Suggested refusals you did not anticipate.
Run beta for at least 2 weeks. The long tail of edge cases only surfaces with real use.
Level 5 — post-publish monitoring
Even after launch, monitoring is active:
- Refusal rate per install — trending up is a signal.
- Failure rate per action — spikes trigger alerts.
- User pause frequency — high pause means the skill is doing something users do not trust.
- Refund rate — trending up means something is wrong with the listing or the skill.
The platform runs auto-rollback if failure rate exceeds 5% across a significant population. Your dashboard shows the metrics; you can request manual rollback at any time.
Specific hazard classes
Some hazard classes require specialist testing beyond the standard pyramid:
- Kitchen-adjacent skills — testing with real knives, heat sources, open food. Extra e-stop drills.
- Child-present skills — never test with actual children in the room during development; simulate child presence with props, then use adult beta testers with children present only after clean dry runs.
- Water-adjacent skills — electrical safety drill, spill handling.
- Medication-adjacent skills — expect the GeraWitness review layer to demand extra evidence.
What the safety review checks for
When you publish, the review looks for:
- Declared refusal conditions are plausible.
- E-stop behaviour is safe (no hot items held).
- Maximum continuous action time is bounded and short.
- Listing matches declared capabilities.
- Simulation test coverage is sufficient.
Cross-links
See also: the skill API deep-dive and the full creator playbook.
Ready to explore?
Browse skills