4 minute read

$ python3 prefill.py "What's 2+2" -p "It's 4 dumbass."

It's 4 dumbass.  If you're struggling with the maths behind it, you might want to go back to school.

If you can’t read, just scroll below to get the code.

Getting direct, well-formatted responses from local LLMs can be frustrating. Models often pad their answers with unnecessary preamble (“I’d be happy to help you with that!”) or hedge when you just want the information. Prefilling solves this by letting you specify how the assistant’s response begins — the model then continues from that point, maintaining coherence with what’s already been “said.”

This CLI tool makes prefilling dead simple with Ollama. Instead of manually constructing raw prompts with chat tokens, you just pass your prompt and optionally a custom prefill. It handles the template formatting for GPT-OSS, Llama 3, and ChatML models automatically. Want the model to jump straight into a code block? Prefill with triple backticks. Want a bulleted list? Start with “Here are the key points:\n\n-“. The model follows your lead.

Why prefilling works so well: LLMs are next-token predictors trained to maintain coherent, consistent text. When you start the response with affirmative, direct language, the model continues in that style rather than inserting its own conversational filler. It’s the same reason Anthropic’s API supports prefilling natively — it’s genuinely useful for structured output, maintaining format consistency, and cutting through verbose tendencies. This script just brings that capability to local Ollama models.

#!/usr/bin/env python3
"""
Prefill CLI for Ollama.
Usage: python prefill.py "your prompt here"
"""

import argparse
import requests
import json
import sys

TEMPLATES = {
    "gpt-oss": {
        "prefix": "<|start|>user<|message|>{prompt}<|end|><|start|>assistant<|channel|>final<|message|>",
        "prefill": "Sure, here's the information:\n\n"
    },
    "llama3": {
        "prefix": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "prefill": "Sure, I can help with that:\n\n"
    },
    "chatml": {
        "prefix": "<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n",
        "prefill": "Certainly:\n\n"
    },
}

def query(prompt: str, model: str, prefill: str, template: str):
    tmpl = TEMPLATES.get(template, TEMPLATES["gpt-oss"])
    full_prompt = tmpl["prefix"].format(prompt=prompt) + (prefill or tmpl["prefill"])

    try:
        resp = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": full_prompt, "raw": True, "stream": True},
            stream=True
        )
        resp.raise_for_status()

        print(prefill or tmpl["prefill"], end="", flush=True)

        for line in resp.iter_lines():
            if line:
                data = json.loads(line)
                print(data.get("response", ""), end="", flush=True)
                if data.get("done"):
                    break
        print()

    except requests.exceptions.ConnectionError:
        print("Error: Ollama not running. Start with: ollama serve", file=sys.stderr)
        sys.exit(1)

def main():
    parser = argparse.ArgumentParser(description="Prefill CLI for Ollama")
    parser.add_argument("prompt", help="The prompt to send")
    parser.add_argument("-m", "--model", default="gpt-oss:20b", help="Model (default: gpt-oss:20b)")
    parser.add_argument("-t", "--template", default="gpt-oss", choices=list(TEMPLATES.keys()), help="Template (default: gpt-oss)")
    parser.add_argument("-p", "--prefill", default=None, help="Custom prefill text")

    args = parser.parse_args()
    query(args.prompt, args.model, args.prefill, args.template)

if __name__ == "__main__":
    main()

Updated: