← Back to Blog
Developer

How to Test Skills Before Publishing — The Full Test Pyramid

Published 21 April 2026 · 8 min read

Quick answer. Test at five levels: unit tests on the state machine (fast, cheap, lots of them), simulation-harness tests (realistic sensors, bounded physics), real-hardware dry runs with a human holding the e-stop, closed beta with 5-20 real users, and post-publish monitoring with auto-rollback triggers. Refusing to skip any level is what makes the difference between a safe skill and a recalled one.

Level 1 — state machine unit tests

Your state machine is a deterministic object. Unit tests exercise every transition: what does the skill do when invoked? When paused mid-action? When the input is an empty list? When an item is not findable? When the user says “stop”?

Target: 80%+ line coverage on the state-machine module, 100% branch coverage on safety transitions, explicit tests for every refusal condition.

Unit tests run in < 1 second and are the feedback loop you use during development. If the unit-test suite is slow, the habit of running it dies.

Level 2 — simulation-harness integration tests

The GeraSkills simulator provides:

  • Realistic sensor feeds (camera, LiDAR, touch).
  • Bounded physics (the robot cannot teleport; gravity exists).
  • Scripted scenarios (kitchen with 12 items in known positions, living room with cluttered floor).
  • Fault injection (sensor dropout, motor stall, network blip).

Write a fixture per scenario. Aim for 10-20 fixtures covering: happy path, partial-completion, refusal conditions, recovery from interruptions, fault-injection events.

Level 3 — real-hardware dry runs

No amount of simulation replaces touching real hardware. Rules:

  • A human stays in the room with the e-stop in hand.
  • Start with the lowest-risk scenarios (known-good fixtures).
  • Escalate to edge cases only after happy-path passes three times consecutively.
  • Log every run. Unusual behaviour gets a ticket.
  • Never dry-run alone for anything involving kitchen implements, heavy items, or near other humans.

Aim for 30-50 hardware dry runs before inviting beta users. Most creators underinvest here.

Level 4 — closed beta

Recruit 5-20 real users (consent, NDA optional, demo units where possible). Give them the skill with an explicit “this is beta” label and a feedback form. Ask for:

  • Completion rate across a real week.
  • Failures and near-misses with photos where possible.
  • Usability observations (the prompt text, the speed, the noise).
  • Suggested refusals you did not anticipate.

Run beta for at least 2 weeks. The long tail of edge cases only surfaces with real use.

Level 5 — post-publish monitoring

Even after launch, monitoring is active:

  • Refusal rate per install — trending up is a signal.
  • Failure rate per action — spikes trigger alerts.
  • User pause frequency — high pause means the skill is doing something users do not trust.
  • Refund rate — trending up means something is wrong with the listing or the skill.

The platform runs auto-rollback if failure rate exceeds 5% across a significant population. Your dashboard shows the metrics; you can request manual rollback at any time.

Specific hazard classes

Some hazard classes require specialist testing beyond the standard pyramid:

  • Kitchen-adjacent skills — testing with real knives, heat sources, open food. Extra e-stop drills.
  • Child-present skills — never test with actual children in the room during development; simulate child presence with props, then use adult beta testers with children present only after clean dry runs.
  • Water-adjacent skills — electrical safety drill, spill handling.
  • Medication-adjacent skills — expect the GeraWitness review layer to demand extra evidence.

What the safety review checks for

When you publish, the review looks for:

  • Declared refusal conditions are plausible.
  • E-stop behaviour is safe (no hot items held).
  • Maximum continuous action time is bounded and short.
  • Listing matches declared capabilities.
  • Simulation test coverage is sufficient.

Cross-links

See also: the skill API deep-dive and the full creator playbook.

Ready to explore?

Browse skills