Local LLM
orangu is designed to talk directly to a local llama.cpp server using its OpenAI-compatible API.
Example configuration#
[orangu]
server = main-server
model = ggml-org/gemma-4-E4B-it-GGUF
timeout = 1800
max_tool_rounds = 10
[main-server]
provider = llama.cpp
endpoint = http://localhost:8100/v1
model = ggml-org/gemma-4-E4B-it-GGUF
Quick verification#
Check that the server is reachable:
curl http://localhost:8100/v1/models
Then run the client:
orangu --config ./orangu.conf
Notes#
- The endpoint may be configured as either the server root (
http://localhost:8100) or the/v1path. The client normalizes it internally. - Tool-calling prompts can be slow on local models; a larger timeout (e.g.
timeout = 3600) is recommended - The local tools run against the current workspace and can edit files on disk
- If you start
llama-serverwith--api-key <key>, setapi_key = <key>in the server section. The key is sent on every request including the/v1/modelshealth probe.
Recommended llama-server flags#
For best results with tool-calling workloads:
llama-server \
--model /path/to/model.gguf \
--port 8100 \
--ctx-size 65536 \
--jinja \
--chat-template chatml \
-sm layer \
-t 4
Context size matters: Tool-calling conversations accumulate tokens quickly. A context size of at least
32768 is recommended; 65536 or more is ideal for long sessions.
Coding model#
For coding-focused sessions, launch llama-server with a coding model:
llama-server -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF \
--port 8100 \
--ctx-size 131072 \
-t 4 \
--webui-mcp-proxy \
--fit on \
--image-min-tokens 1024 \
--tools all
or
llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
--port 8100 \
--ctx-size 262144 \
-sm layer \
-t 4 \
--webui-mcp-proxy \
--fit on