Local LLM

orangu is designed to talk directly to a local llama.cpp server using its OpenAI-compatible API.

Example configuration#

[orangu]
server = main-server
model = ggml-org/gemma-4-E4B-it-GGUF
timeout = 1800
max_tool_rounds = 10

[main-server]
provider = llama.cpp
endpoint = http://localhost:8100/v1
model = ggml-org/gemma-4-E4B-it-GGUF

Quick verification#

Check that the server is reachable:

curl http://localhost:8100/v1/models

Then run the client:

orangu --config ./orangu.conf

Notes#

For best results with tool-calling workloads:

llama-server \
  --model /path/to/model.gguf \
  --port 8100 \
  --ctx-size 65536 \
  --jinja \
  --chat-template chatml \
  -sm layer \
  -t 4
Context size matters: Tool-calling conversations accumulate tokens quickly. A context size of at least 32768 is recommended; 65536 or more is ideal for long sessions.

Coding model#

For coding-focused sessions, launch llama-server with a coding model:

llama-server -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF \
             --port 8100 \
             --ctx-size 131072 \
             -t 4 \
             --webui-mcp-proxy \
             --fit on \
             --image-min-tokens 1024 \
             --tools all

or

llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
             --port 8100 \
             --ctx-size 262144 \
             -sm layer \
             -t 4 \
             --webui-mcp-proxy \
             --fit on