Level 1 — state machine unit tests

Your state machine is a deterministic object. Unit tests exercise every transition: what does the skill do when invoked? When paused mid-action? When the input is an empty list? When an item is not findable? When the user says “stop”?

Target: 80%+ line coverage on the state-machine module, 100% branch coverage on safety transitions, explicit tests for every refusal condition.

Unit tests run in < 1 second and are the feedback loop you use during development. If the unit-test suite is slow, the habit of running it dies.

Level 2 — simulation-harness integration tests

The GeraSkills simulator provides:

Realistic sensor feeds (camera, LiDAR, touch).
Bounded physics (the robot cannot teleport; gravity exists).
Scripted scenarios (kitchen with 12 items in known positions, living room with cluttered floor).
Fault injection (sensor dropout, motor stall, network blip).

Write a fixture per scenario. Aim for 10-20 fixtures covering: happy path, partial-completion, refusal conditions, recovery from interruptions, fault-injection events.

Level 3 — real-hardware dry runs

No amount of simulation replaces touching real hardware. Rules:

A human stays in the room with the e-stop in hand.
Start with the lowest-risk scenarios (known-good fixtures).
Escalate to edge cases only after happy-path passes three times consecutively.
Log every run. Unusual behaviour gets a ticket.
Never dry-run alone for anything involving kitchen implements, heavy items, or near other humans.

Aim for 30-50 hardware dry runs before inviting beta users. Most creators underinvest here.

Level 4 — closed beta

Recruit 5-20 real users (consent, NDA optional, demo units where possible). Give them the skill with an explicit “this is beta” label and a feedback form. Ask for:

Completion rate across a real week.
Failures and near-misses with photos where possible.
Usability observations (the prompt text, the speed, the noise).
Suggested refusals you did not anticipate.

Run beta for at least 2 weeks. The long tail of edge cases only surfaces with real use.

Level 5 — post-publish monitoring

Even after launch, monitoring is active:

Refusal rate per install — trending up is a signal.
Failure rate per action — spikes trigger alerts.
User pause frequency — high pause means the skill is doing something users do not trust.
Refund rate — trending up means something is wrong with the listing or the skill.

The platform runs auto-rollback if failure rate exceeds 5% across a significant population. Your dashboard shows the metrics; you can request manual rollback at any time.

Specific hazard classes

Some hazard classes require specialist testing beyond the standard pyramid:

Kitchen-adjacent skills — testing with real knives, heat sources, open food. Extra e-stop drills.
Child-present skills — never test with actual children in the room during development; simulate child presence with props, then use adult beta testers with children present only after clean dry runs.
Water-adjacent skills — electrical safety drill, spill handling.
Medication-adjacent skills — expect the GeraWitness review layer to demand extra evidence.

What the safety review checks for

When you publish, the review looks for:

Declared refusal conditions are plausible.
E-stop behaviour is safe (no hot items held).
Maximum continuous action time is bounded and short.
Listing matches declared capabilities.
Simulation test coverage is sufficient.

How to Test Skills Before Publishing — The Full Test Pyramid