the beginner guide to making small models usable
it's time to split work across tiny models so your agents stay reliable and cheaper
small local models are not dumb. they are just easier to break.
& please stop asking 3b models to do 30b jobs!
they usually fail in three predictable ways:
they lose the plot on multi-step tasks
they choose tools badly (files, web, function calling, memory)
they hallucinate when they are unsure
if you are building in openclaw, these weaknesses show up faster because openclaw workflows tend to load a lot of context, tool schemas, state, and logs. ollama’s openclaw docs recommend using at least 64k tokens of context for local models so the agent has room to keep track of what it is doing.
what “small” means
when you see 1b, 3b, 7b, the “b” means billions of parameters.
a beginner-friendly rule of thumb:
tiny (0.5b to ~2b): super fast, cheap. great for simple jobs.
small (~3b to ~8b): can write and code a bit. still fragile under pressure.
agent-capable (20b+): more stable for planning and tools, but heavier and slower.
parameters are not everything, but they correlate strongly with “how long can this model think before it starts roleplaying competence.”
what small model failure looks like in openclaw-style agenting
1) tool calling gets weird
small models commonly:
call tools when they should just answer
refuse tools when a tool is required to be correct
output “tool json” as text instead of behaving like a tool-using agent
you will see people complain about exactly this in local model communities, especially when trying to run agent loops and structured outputs.
2) it follows one rule, then forgets the other three
ask for strict json plus a schema plus brevity plus “do not guess.” the model often drops one constraint silently.
3) more context can make it worse
it is tempting to paste everything. small models can degrade when overloaded. openclaw needs long context, but the context still has to be structured and relevant. that is part of why the 64k recommendation exists.
which models work well with openclaw
if you want the cleanest “start here” list, ollama’s openclaw blog names models that work well with openclaw, including:
qwen3-coder
glm-4.7
glm-4.7-flash
gpt-oss:20b
gpt-oss:120b
that list matters because it is grounded in “people actually running openclaw,” not just general model popularity.
fun thought experiment: can 3 to 5 tiny models beat one big model?
the fantasy
take 3 to 5 tiny models. give each one a different role and a strict system prompt. chain them together.
router: decides the next action
extractor: turns messy input into clean json
worker: writes or codes the output
verifier: checks for missing info and confident nonsense
optional: run worker twice and vote
the reality check
this is not true mixture-of-experts (moe) in the training sense. real moe has a trained router that learns routing and expert usage. a “multi-model prompt pipeline” is closer to an ensemble protocol.
but here is the cool part: protocols can still improve reliability.
two research-backed reasons:
self-consistency: generate multiple reasoning attempts and pick the most consistent answer. it improved results on reasoning benchmarks in the original paper.
vote beats debate (often): research finds that majority voting can account for most gains typically attributed to agent “debate.”
translation: you can get meaningful gains from “multiple tries + selection,” especially when you can verify outputs.
reality check: how far are local small models from frontier paid models?
frontier paid models have two big advantages that show up immediately in real work:
1) long context
claude opus 4.6 supports very large context windows, with 1m tokens available in beta on the claude developer platform.
long context changes what tasks the model can keep in working memory without dropping threads.
2) overall capability
a measurable proxy is openlm’s chatbot arena, which provides an elo leaderboard and a win-probability formula. numbers drift over time, but the general pattern is stable: frontier models win most head-to-head comparisons, especially on messy multi-step work.
simple takeaway for beginners: frontier models are usually better at staying coherent, planning longer, and using tools correctly. local small models can still win on cost, privacy, and control.
the thing you can steal: tiny-moe prompt pack (beginner proof)
this is the part you can copy right now. it makes small local models feel dramatically less chaotic in openclaw-style systems.
what this pipeline does
instead of one model doing everything, you run a simple assembly line:
router decides what kind of task this is
extractor pulls structure so the system stops guessing
worker produces the result (write, code, summarize)
verifier catches lies, missing info, and broken json
optional: run worker twice and vote for higher reliability
this works because most “small model dumbness” is really “wrong job assignment.”
step 1: router prompt (the traffic cop)
use this on a tiny or small model. it only chooses the next action.
system:
you are the router for a multi-model workflow.
return strict json only. no extra text.
routes:
- “answer_directly” (simple question, no tools, no long planning)
- “extract_json” (turn input into structured fields)
- “call_tool” (files, web, code execution, database)
- “escalate” (send to stronger model for multi-step planning)
- “needs_more_info” (ask 1 to 3 questions)
rules:
- do not guess.
- if a critical input is missing, use “needs_more_info”.
- choose “call_tool” only when a tool is required to be correct.
- choose “escalate” when multi-step planning or high accuracy reasoning is required.
schema:
{
“route”: “answer_directly|extract_json|call_tool|escalate|needs_more_info”,
“why”: “one sentence”,
“tool_name”: “string or null”,
“escalate_to”: “string or null”,
“questions”: [”only if needs_more_info”]
}how beginners should use it:
run router first on every request
obey the route
stop forcing the cheap model to do hard jobs
how advanced users should use it:
log router decisions
track failure modes
improve routing rules over time
step 2: extractor prompt (turn chaos into clean inputs)
use this when the router returns extract_json.
system:
you are the extractor. extract only what is explicitly present.
do not infer.
return strict json only. no extra text.
if a field is missing, set it to null.
schema:
{
“entities”: [],
“requirements”: [],
“constraints”: [],
“data_points”: [],
“unknowns”: []
}why it helps:
it makes missing info obvious
it reduces hallucinations because the next model is not forced to “fill the gaps”
step 3: worker prompt (do the job)
this is where you write the post, draft the email, generate the code, etc. use your stronger local model here, or escalate to a paid model for hard tasks.
system:
you are the worker.
follow the extracted requirements and constraints exactly.
do not add new facts.
if you need missing info, ask concise questions.
when producing json, output strict valid json only.step 4: verifier prompt (catch confident nonsense)
run this after the worker when accuracy matters.
system:
you are the verifier. your job is to catch confident nonsense.
steps:
1) list factual claims in the draft
2) mark each claim supported or unsupported using only provided context
3) detect missing assumptions or missing inputs
4) validate json schemas if present
5) return a corrected version if possible, otherwise ask questions
output format:
claims:
issues:
corrected_output:
questions:advanced upgrade:
add deterministic validators (json schema checks, unit tests, linting)
verification becomes objective, not vibes
optional: the beginner voting trick
if the task is important:
run the worker twice with different randomness settings
feed both outputs to the verifier and have it pick the better one and explain why
this is the practical version of “multiple attempts + selection,” which is the same core idea behind self-consistency style gains.
bottom line
small local models look insanely dumb when you force them to be planner, tool-user, writer, and verifier all at once.
give them a clean pipeline, keep openclaw context large (64k+), and add verification. you will not fully match frontier models like claude opus 4.6 on long messy tasks, but you can close a meaningful chunk of the gap for real business automation while keeping cost and privacy under your control.
before you go.
if you plan to use openclaw seriously, the real skill is not finding the newest model. it is learning how to spend fewer tokens while getting the same or better results.
most builders do the opposite. they keep sending bigger prompts to bigger models and watch their usage explode.
the smarter approach is small model arbitrage. let tiny models handle the cheap work like routing, filtering, and preparation. only bring in stronger models when the task actually deserves it.
i’m putting together a deep breakdown on this approach specifically for openclaw. it covers the routing patterns, model roles, and token-saving habits that quietly reduce your bill without hurting reliability.
if you plan to build seriously with openclaw, you will want that guide when it drops.
paid subscribers get it.
p.s. paid or free, i’m really just glad you’re here. if you’re experimenting with openclaw right now, you’re super early and literally on the edge of where this is all going. everyone you know will operate on the agentic layer in some way in the very near future and it’s only at sub-1% realization right now.
josh /openclaw




Some of the Qwen3.5 models are nice in this space as well. What do you think of the sizing criteria for the "auto-sandboxing" feature?
https://chetantekur.substack.com/p/making-hybridclaw-leveraging-on-device?r=f2tmd&utm_medium=ios