Docs/Advanced

How the Agent Works

Nephele's Agent isn't a chatbot — it's an assistant that actually gets work done. Once it understands what you want, it plans the steps on its own, calls tools, observes the results, adjusts its approach, and keeps going until the job is finished.

At the heart of this is the ReAct loop (Reasoning + Acting): think first, then act, observe the result, then think about the next step. But before any of that, Nephele runs a fast rule check — if your instruction is clearly simple, it just executes it directly, without burning LLM compute.

Two Execution Paths

When the Agent handles your request, it tries the first path first, and only falls through to the second when the first doesn't apply:

Path One: Rule Engine (zero latency)

For clear, simple commands like "open Chrome" or "list the files on my desktop," the Agent parses them directly with its built-in rule engine, and runs the matching tool the moment it finds one. No LLM is involved at any point — the response is instant.

Path Two: ReAct Loop (LLM reasoning)

Complex or ambiguous requests go into the ReAct loop. The Agent and the LLM go back and forth over multiple turns: each turn, the LLM decides which tool to call, the Agent runs it and feeds the result back, and this continues until the task is done or the limit is reached.

Before entering ReAct, there's also a free guide-channel gate that first absorbs lightweight "how do I use the software" questions (see "Routing & Billing" below). Only requests that genuinely need hands-on work proceed to full Cloud MAX.

	Rule Engine	ReAct Loop
Latency	Milliseconds	Seconds (depends on model and network)
Uses LLM	No	Yes
Best for	Clear open/close/query commands	Analysis, creation, multi-step tasks
Tool calls	1	Up to 20

Routing & Billing

The main input box is a single unified entry point. After you type, your request is handled in three tiers, from lowest cost to highest:

Tier	Handled by	Uses LLM	Cost	Best for
Rule Engine	Local, zero latency	No	None	Clear open/close/query commands
Guide-channel gate	Lightweight model Axioma Zephyr	Yes (lightweight)	Free, no stamina charged	Software usage, navigation, one-line facts
Cloud MAX	Axioma Breeze	Yes	Billed in cloud credits	Analysis, creation, multi-step tasks

提示

The routing is transparent to you — you never have to pick a mode by hand. If the rule engine can't handle it, the guide channel absorbs one free round; only if that still can't handle it does the request pass through to Cloud MAX. If the guide channel fails, the request is always let through — it will never block your input. Cloud MAX reasons over the network: it first draws from your Stamina, and only once Stamina is exhausted does it draw on cloud credits (Nepheline).

The one "special tool" difference is in Cloud MAX: once in Cloud MAX mode, two extra tools are registered — cloud_search (live cloud search) and delegate_tasks (subtask delegation).

The ReAct Loop: One Step at a Time

Once in Cloud MAX, the Agent runs a ReAct loop. One full interaction looks like this:

Build the request — combine the system prompt (tool descriptions, your memories, the current time) + the conversation history + your new message, and send it to Axioma Breeze as a streaming SSE request
Stream the response — the model returns a chain of thought (thinking) and the main body (content), which the Agent forwards to the UI in real time
Tool calls — if the model decides to call tools, the Agent runs them, formats the results, and appends them to the conversation history; it supports collecting multiple tool calls in one turn and executing them in a batch (which fixed an early bug where multiple tool calls were dropped)
Loop decision — if the task isn't finished, it returns to step 1 and continues the next round

Iteration Limit

The hard cap is 20 tool calls. Once it's reached, the Agent stops and tells you "the task may not be fully complete."

Stop Mechanism

You can hit the stop button at any time. The Agent checks for the cancel signal at these points:

Before the start of each iteration
During SSE stream reception
Between serial tool executions

After you stop, the results of tools that already ran are not undone, but any subsequent tool calls are cancelled.

The Tool System

Intent-Aware Tool Loading

The number of Agent tools keeps changing, but it does not dump the full parameter details of every tool into the model at once — it loads them on demand. Initially only the Tier 1 core tools are loaded (about twenty common tools, including tool_search); the remaining ~forty tools are listed by name only, grouped by category, inside an <available-tools> block. When the model needs one, it calls tool_search to dynamically load its concrete schema. This cuts the per-turn input tokens from around 3,000 down to around 500, saving a lot of context.

Cloud tools billed in cloud credits (such as cloud_search and find_artist_works) are never even exposed on the purely local quick-command path — an extra layer of protection.

Parallel Execution

If the model requests multiple tools in a single turn, the Agent decides which ones can run in parallel based on each tool's own concurrency_safe attribute — the ones that can are run simultaneously via a ThreadPoolExecutor (up to 4 threads), and the rest run serially.

run_python is never parallelized, and before it runs you must confirm the code in a dialog (which you can edit).

The Memory System

The Agent's memory is card-based: each memory is its own Markdown file, stored locally under ~/.nephele_workshop/memory/, with a short frontmatter (title, one-line summary, type).

The Five Types of Memory

Through the memory_write tool, the Agent saves information worth keeping long-term as a memory, each filed under one of these five types:

user — who you are (your creative style, personal habits)
workflow — how you work (your usual processes, preference settings)
project — what you're working on (the context of the current project)
reference — reference info such as your accounts and frequently used paths
correction — corrective preferences for the Agent's behavior

Only content worth keeping long-term across sessions gets written; intermediate state like temporary preferences from the current session or tool-call details is not stored.

Index and On-Demand Reading

The Agent doesn't stuff every memory's full text into the context. Instead it injects only an automatically maintained index (MEMORY.md) — each memory takes just one line in the index (title + one-line summary), grouped by type and sorted by most recently updated within each group. From the index, the Agent knows which memories are available, and only when it actually needs the details of one does it call memory_load to read the full text.

Memories are created, updated, and deleted by three tools: memory_write (create or overwrite by the same name), memory_update, and memory_delete. The injected index is capped at around 4,000 characters; each memory's summary is capped at 200 characters and its body at 6,000 characters, with anything beyond that truncated. When there are no memories at all, the index is empty and uses no tokens.

技巧

All memories live only on your local device and are never uploaded to a server.

Sub-Agent Delegation

For complex tasks, the Agent can call delegate_tasks to break the work into multiple subtasks. Each subtask is handled by a sub-agent.

Limits:

At most 6 subtasks
A 120-second timeout each
Genuinely parallel execution (ThreadPoolExecutor, up to 4 concurrent)
Available only in Cloud MAX mode (delegate_tasks is a Cloud MAX–exclusive tool)

Sub-agents have a few extra constraints:

They cannot delegate further sub-agents (to prevent infinite recursion)
They cannot run operations that require user confirmation (such as run_python)

There's also internal anti-pattern detection: if you ask for reference images, the main Agent should not use delegate_tasks to split the work into several image-finding subtasks, because find_references already searches multiple platforms in parallel internally.

Safety and Authenticity Constraints

run_python Confirmation

Before running Python code, a confirmation dialog pops up showing the full code and its description. You can cancel, confirm directly, or edit it before confirming.

Image Compression

When the Agent receives an image attachment, it automatically compresses it to a longest edge of 1536px at JPEG quality 85%, reducing the base64 transfer size and token consumption. Animated formats like GIF are kept as-is.

Hard Authenticity Constraints (Cloud MAX)

Cloud MAX's system prompt embeds top-priority constraints:

Never fabricate any ID, URL, file path, or image information
Only use data that genuinely appears in tool return values
When a tool returns 0 results, honestly say "not found" — don't make up data to fill the gap
A reverse-search URL must be resolved through resolve_source_url before it can be reported as the "original author"

Tool Error Self-Correction

After a tool fails, the Agent prepends an [ERROR] tag to the result and adds an actionable hint:

"File not found" → suggests using search_files to confirm the path
"Permission denied" → suggests trying a path under the user's home directory
"Timeout" → suggests retrying or trying a different approach

Event-Stream Architecture

The Agent doesn't "finish a whole paragraph and then show it" — it streams output in real time:

The chain of thought (thinking) is shown live
The main body (content) appears character by character
Tool calls notify you on both start and finish
Quota and stamina consumption update in real time

The Agent core produces events (subclasses of AgentEvent) through a pure generator, which AgentWorker then converts into Qt Signals and sends to the QML interface. This design lets the Agent run on a background thread without freezing the UI.

The main event types:

Event	Description
`StreamChunk`	A fragment of the main body text
`ThinkingChunk`	A fragment of the chain of thought
`ToolStart`	A tool has started running
`ToolFinish`	A tool has finished (with success/failure status and the result)
`NavigateView`	The interface needs to switch views
`RateLimitUpdate`	Stamina quota update
`StaminaCost`	Stamina consumed by a single request
`CreditsUpdate`	Cloud-credit balance update
`LoopError`	The loop hit an error (network / auth / service unavailable)
`LoopDone`	One conversation round completed

Operation Journal and Undo

Every Agent conversation round is recorded to the OperationJournal, generating a unique round ID. On completion, a  marker is injected at the end of the message. This means some operations may support undo in the future (the marker is already embedded; the UI layer is yet to be implemented).

Last updated Jun 21, 2026·Applies to v0.5.2-beta