Kernel for AI Agents
Introduction
Building sophisticated AI agents today feels like having to create a grand cathedral rose window for every task, no matter how simple.
We, as developers, become artisans, painstakingly cutting, shaping, and molding together countless unique shards of "prompt-glass," framing them together via bespoke scripts that need to account for the shape of each and every fragment. The results can be genuinely awe-inspiring and immensely satisfying - when everything works just right.
However, while such an artesanal effort may be justifiable, even needed, for a monumental cathedral or a lord's palace, it is simply unsustainable when we need not one, but hundreds or thousands of reliable "windows" for diverse, everyday applications. Yet, like that medieval masterpiece, each agent built this way is a one-off: immensely labor-intensive, incredibly complex, and terrifyingly fragile. A subtle flaw in one hand-crafted component, an unexpected shift in the underlying data, or a tremor from an LLM update can crack the entire magnificent design, sending us scrambling to repair a unique, opaque structure.
Of course, as builders, we are artists. However, we are also engineers, and we intend to build systems that are robust, scaleable, and appropriate for their task. It simply does not make sense to go through the process of building artisanal glass, when it’s only meant to keep the cold out of your house.
This problem, however, is nothing new to computing. After all, early programs were written in raw machine code, with the computer processing a human defined task, only for the human to take the result and painstakingly design the next task, fitting them all together using their own domain expertise. Every program was a rose window, and every programmer was as much an artisan as an engineer. Of course, such a process not only created a massive barrier to entry, but also meant that early programs could not be simply plugged together, let alone become a constant presence in every home.
Thankfully, those early days also provided a blueprint for the path forward. The journey from human-led, bespoke computing to the ubiquity of reliable, glass-covered computers in every pocket wasn't just about faster and better processing (be it hardware or, in our case, LLMs) ; it was fundamentally enabled by a new layer of abstraction and control: the Operating System. This crucial software innovation brought order, structure and abstraction to raw machine capabilities. It provided the stability, resource management, and standardized interfaces that allowed developers to stop hand-crafting every interaction with the bare metal and start building complex applications with unprecedented efficiency and reliability.
Today, as we build with Large Language Models, we face a strikingly similar set of systemic challenges, albeit expressed in a new vernacular. We manually wrestle with an LLM's finite context window, a digital echo of early computing's scarce memory, forcing us to constantly curate what our agents 'see' and 'remember.' We combat hallucinations and inconsistent states, where the agent's grasp of reality can drift or become corrupted, much like early, unprotected programs could leak memory and trample each other's data before kernels enforced process isolation and memory integrity. The immense reprocessing burden and its associated costs arise because there's no systemic way to capture, validate, and reuse knowledge efficiently, forcing our agents to re-learn or re-extract information repeatedly, akin to every early application re-implementing its own disk I/O routines. Agents falter in complex tool selection or misuse APIs, mirroring the chaos of applications directly addressing diverse hardware before standardized drivers and system calls. And the critical dependencies between tasks are often left to fragile chains of prompts, hoping the LLM intuits the correct sequence, a complex orchestration task that operating system schedulers mastered decades ago.
When every single process has to be done de novo for every application, it’s not at all surprising that every LLM application feels so bespoke: Every application is in truth a collection of handcrafted shards, molded together by the vision of the builder. There is no kernel-like layer for AI, and without it, each agent remains that magnificent, standalone 'rose window' – impressive, but isolated and vulnerable. The very problems that operating systems solve for traditional computing—managing volatile memory, ensuring process and data integrity, standardizing interactions with diverse resources, and orchestrating complex, dependent tasks—are now manifesting with compelling urgency in the realm of AI agents.
Here, we lay out our vision for the Praxos Kernel. Our goal is to provide builders with the control, stability and abstraction they need to be able to fully focus on what THEIR application needs to do, instead of rebuilding the wheel each and every time.
The Original Problem
The idea for a kernel for AI agents was born from the pain of building agents that deal with complex real world data. Praxos began as an Insuretech startup. Our first product was supposed to be exceedingly simple: Extract all data from insurance documents, and allow insurance professionals to manipulate and insert them into other formats as needed using LLMs. It quickly became evident that we had underestimated the scope and difficulty of this problem. For one, there is no clear boundary between insurance data and non-insurance data. In fact, any information can be insurance data, if viewed through the right lens. A building appraisal, a ship’s technical specification, and an income statement may all be as much bona fide insurance data sources as an insurance policy from an insurance company. Insurance data proved to be intricate, rich in relationships, and often unstructured. Conventional methods – relying on LLMs to directly interpret raw documents, flat data structures, or simple vector stores for every task – proved unreliable and unscalable.
Key issues became immediately apparent:
Lost Relationships: Critical connections within data (e.g., linking a specific insurance Claim to its Policy, covered Vehicle, involved User, and applicable Coverage terms) were frequently missed or misinterpreted by models processing raw input, crippling effective reasoning. Understanding procedural dependencies (Task A must precede Task B) was equally fragile. In reality, we’ve found that LLMs fail at even more basic tasks, such as simply extracting all policy level data for all automobiles in an insurance policy.
Inconsistent State: Allowing direct, unvalidated updates to shared knowledge led to fragmented and unreliable state, especially with multiple agents or processes involved. When related data had to be generated together, there was very little, if any, visibility into the process, making it so that we needed significant work in prompt engineering or else repeating calls to get good results. Even then, results were unreliable, changing between runs.
Context Overload & Cost: Feeding massive amounts of potentially relevant raw text or unstructured data into LLM prompts for every decision is inefficient, costly, and often exceeds token limits, necessitating complex and brittle RAG implementations per use case. Our data points clustered extremely close together when using RAG, meaning that vector based searching alone was rife with false positives and false negatives.
Reprocessing Burden: The same information needed constant re-extraction, re-validation, and re-formatting for different tasks because it wasn't captured initially in a structured, reusable format with its relationships intact.
Category Errors: Agents frequently misused data or tools due to a lack of understanding of the semantic meaning of the information (e.g., confusing PolicyID with ClaimID, misapplying financial limits, dropping limits that were based on time as opposed to a set dollar amount).
Extreme Data Heterogeneity: The need to handle and relate vastly different data types (text, numbers, codes, dates, locations, etc.) demanded a more structured approach than simple key-value or document stores.
Once we actually solved this problem, however, further problems simply arose: Even if we had this data engine, our service became a web of disconnected agentic and non-agentic ai workflows. Chat based functionality and static workflows were all disconnected, and relying on the AI to do tool choice correctly to use our own APIs was a headache, leading to an ever increasing maintenance cost. It was simply not worth it to spend the time to connect them, even if it led to a worse client experience and a higher barrier to entry, because we would have to constantly change them to account for updates.
These challenges highlighted that reliable AI operation required moving beyond relying on output schemas, sources and prompts, wherein every single step was just another LLM call. In short, we were tired of always knowing that what we are building could simply randomly not work.
Foundation: An Automatically Generated, Typed Knowledge Graph
Simply put, much of the information needed by an LLM does not exist in the source data explicitly. In particular, most data is not a simple key value pair, but is in fact defined by relationships, which are either represented visually (a table) or put into objects that demonstrate the relationship structurally. While humans and some LLMs are able to parse such information, the relationships themselves are not inherently present in the source data, and as such, are not searchable.
This means that we rely on the LLM to re-extract the data again and again, forcing us to send more context than necessary to ground each extraction. Whatsmore, since there is no real guarantee on the veracity of the extracted data, the data can simply end up being semantically incorrect. Since this process is inherently sequential in nature, a mistake can simply invalidate the entire process, requiring the opaque extraction process to be done again.
The need for such explicit structuring extends crucially to the type of data itself, but in a way that transcends conventional notions of 'integer', 'string', or 'float'. For an AI agent to operate with precision and avoid costly misinterpretations, the system must understand the semantic meaning of each piece of information as well. Is a given string a 'CompanyName', a 'PolicyNumber', a 'VehicleModel', or a 'CountryName'? This extends to higher or lower levels as well: A Company Name and a Personal Name or both, in the end, proper names. A car and a boat are different and thus have different properties, yet they are both vehicles, and thus share properties and behaviours when viewed from different lenses. Each of these semantic relationships thus carries distinct implications, validation rules, and potential relationships that are invisible if treated as generic text. This granular, semantic typing is therefore not just a label but a foundation. It enables the system—and by extension, the AI agents—to disambiguate concepts, enforce domain-specific logic, guide tool selection with appropriate inputs, drastically reduce category errors (like mistaking a Claim ID for a Policy ID), and ultimately ensure that LLM interactions are grounded in a shared, unambiguous understanding of the data's meaning and purpose. Without it, we are stuck shifting the burden of semantic interpretation entirely onto the probabilistic shoulders of the LLM for every single operation.
To address these data-centric problems, our foundational step was developing an automated process capable of analyzing source documents and transforming them into a Graph-Based State & Knowledge Management system. This structured graph, generated upfront, serves as a the initial "source of truth" for the kernel and the agents operating. It features:
Atomic, Typed Nodes & Relationships: The process identifies core entities (like PolicyHolder, Claim, Vehicle, Task) and represents them as nodes with defined semantic types. Specific data points are extracted as distinct Literal Nodes, also carrying specific semantic types (e.g., CoverageLimit expects CurrencyAmount, ClaimStatus expects ClaimStatusEnum, VIN expects VehicleIdentificationNumber). Explicit, typed relationships (insures, filed_under, depends_on) capture the vital connections.
Built-in Consistency & Validation: Because the graph is generated with explicit semantic types, subsequent operations managed by the kernel can rigorously enforce these types. Attempts to store incompatible data or link unrelated entities are flagged, preventing category errors at the data interaction layer.
Relationship Traversal: The explicitly mapped relationships allow kernel services and agents to reliably navigate the knowledge graph to gather context ("Find all Claims related to Policy X involving Vehicles also listed on Policy Y").
Controlled Concurrency: The graph structure supports locking mechanisms (managed via kernel services) for safe concurrent updates to shared state.
State Propagation: The explicit relationships enable defining rules for automatic updates across connected nodes.
Transition to Tasks & The Need for a Kernel:
Having developed the capability to reliably model the intricate state of the world in this structured, typed Knowledge Graph, we tackled the next layer: the processes and actions agents perform. We mentally model agents not as conversational interfaces, but as data transformation operations: an agent takes some input state (X1) from the graph, performs computation/reasoning/external calls, and produces an output state (Y1) reflected back into the graph.
This perspective highlighted critical dependencies and potential failures. Consider a user asking an agent, "What's the deductible for my Mercedes-Benz?" A naive agent might query for deductibles immediately. But what if the Mercedes isn't actually covered under the active policy in the graph? The agent might fetch data for a different vehicle, hallucinate, or simply fail. The correct operation requires a dependency: Task A ('Verify Vehicle Coverage') must successfully complete before Task B ('Fetch Deductible') can run. Relying on elaborate prompt engineering to handle every such dependency case made for brittle agents
Furthermore, these agent operations (Tasks) displayed inherent relationships that needed to be managed: dependencies, priorities, resource requirements. We believe that what we needed here was better scaffolding: If the graph provides the structured "memory" and "world model," what provides the reliable "process management," "dependency resolution," "validated I/O control," and "intelligent scheduling"?
Solution: a Kernel for agents
A Kernel is the necessary scaffolding built around the foundational Knowledge Graph. The kernel consists of essential services providing a complete, stable, and efficient runtime environment for agents (viewed as managed processes) interacting with the structured graph data:
Abstraction: Encapsulates the complexity of direct graph interaction, context assembly (loading relevant subgraphs into working memory), LLM/tool invocation, and error handling patterns.
Resource Management: Intelligently allocates, monitors, and enforces limits on compute, memory, LLM tokens/API calls, and registered tools across agents and tasks.
Stability & Security: Implements and enforces mechanisms like state locking, agent sandboxing (process/permission isolation), input/output validation against semantic types, and structured error handling/recovery.
Orchestration: Manages the execution lifecycle of agents and the flow of potentially complex tasks, including dependency resolution and decomposition into executable instructions.
Standardization: Offers consistent APIs (the Kernel Service Boundary) and interfaces for agent development, interaction with kernel services, communication, and tool usage.
Optimally, when implemented, this results in an architecture that is:
Kernel Service Boundary (The "Validated Gateway"): The critical API layer enforcing validation, permissions, and abstraction between agent logic and kernel services operating on the typed graph and external resources.
Modularity & Interchangeability: Design components (LLM backends, tools, potentially even scheduling algorithms) to be pluggable.
Stability & Reliability by Design: Embed mechanisms like locking, sandboxing, rigorous type validation, idempotency where possible, and robust error handling throughout.
Key Kernel Layers & Integration with the Typed Graph:
Our development leverages our graph foundation and builds kernel modules outwards incrementally:
Foundation: Graph-Based State & Knowledge Management, aka "Memory Space"
Kernel Role: Serves as the persistent "source of truth" knowledge base. Kernel services manage loading relevant subgraphs into an agent's transient Working Memory (the "stack") for active processing. Node IDs effectively act as stable pointers ("variable pointers") within this space.
Key Features: Supports relationship traversal, state propagation, and kernel-managed locking ("check-out") for concurrency control.
Working Memory Sync: Kernel services handle the synchronization between the agent's working memory subset and the persistent graph, ensuring updates are validated against semantic types before being committed.
Operationalized Actions: Common read/query patterns or computations on graph data can be "operationalized" into basic instructions that either don’t need an LLM or only need minimal intelligence, reducing redundant LLM calls or agent logic (akin to caching or memoization at the kernel level). Other actions are to be deconstructed as much as possible, prioritizing minimal, atomic level actions as opposed to loading context into a slow and expensive LLM. The LLM agent itself can use these operations, reducing the need for constant tool definitions, or else write them in natural language, relying on the kernel to perform the conversion.
Agent Lifecycle Management (Process Management)
Function: Responsible for instantiating agents (based on templates defining logic, permissions, resource needs), starting/stopping/pausing their execution (e.g., as async tasks or isolated processes), monitoring health, and managing identity/status.
Integration: Creates the "processes" that interact with kernel services. Handles dependencies: Checks if required input data (nodes/relationships) exists in the graph before starting an agent/task. If not, flags status (WAITING_ON_DEPENDENCY) or potentially triggers resolver agents/workflows (future enhancement).
Stability Link: Monitors agent health via interactions with kernel services (e.g., validation failures, resource exhaustion, excessive errors). Can terminate ("kill") misbehaving agents based on rules. Prevents agents from committing invalid state changes via validation at the Service Boundary.
Standardized Flags/Events: Abstracts low-level success/failure. Kernel emits standardized events (e.g., TASK_COMPLETED, VALIDATION_ERROR, API_UNAVAILABLE, DEPENDENCY_MISSING) allowing agents or control logic to react consistently, re-plan, or escalate.
Streamlined and Reliable Tool Integration
The current paradigm of exposing numerous raw tools or API endpoints directly to LLM agents often leads to degraded reasoning performance and a complex 'M clients × N servers' authentication and permissioning web, as seen with emerging Model Context Protocols.
The Praxos Kernel directly mitigates these issues. Firstly, its Tool Registry and contextual understanding (derived from the typed Knowledge Graph) allow the Kernel to present agents with a dynamically scoped and relevant subset of available actions, significantly reducing the LLM's decision space.
Secondly, all Kernel services and instructions possess explicit, typed signatures. This transforms tool selection from a fuzzy match against descriptions into a more structured reasoning process based on data type compatibility between the agent's current context and the tool's requirements. This dramatically improves reliability and reduces category errors.
Finally, the Kernel Service Boundary centralizes permissioning; agents interact with validated kernel operations, and the kernel enforces access control based on agent identity and the nature of the typed data and operations involved, massively reducing the search space with regards to the tool. Further reasoning about operational nature, as discussed in the next section, can further help break the task while aiding tool selection. In practice, this allows agents to reason effectively over a well-defined action space, facilitating the use of scoped toolsets and the construction of more robust 'microagents' without sacrificing the power of LLM-driven choice.
Task Execution & Orchestration Engine (Instruction Set & Control Flow)
Function: Translates agent goals or high-level tasks into executable, typed Instruction Sets (e.g., GRAPH_READ_NODE, API_CALL, COMPUTE).
Components: Includes a Task Interpreter (using LLMs, planners, or rules) for decomposition, an Instruction Dispatcher for routing based on Instant (non-AI, type-safe), Fast (small/specialized AI), or Slow (large LLM) categories, and a Workflow Manager for state/control flow.
Optimization: Allows kernel/user to configure specific LLM backends per call-type (Fast/Slow), enabling cost/performance tuning. Standardized instructions facilitate optimization.
Scheduler (Resource & Dependency Aware)
Function: Manages execution order/timing based on priorities, dependencies (represented as DEPENDS_ON relationships in the graph or task definitions), and resource availability (especially Fast/Slow units, API rate limits).
Advanced Features: Performs readiness checks (are input nodes available? types compatible?). Can potentially queue tasks needing missing API configurations/wrappers (WAITING_ON_CONFIGURATION) and provide interfaces for users to add them. May attempt background generation of simple API wrappers ("syscalls") if feasible. Includes queues for tasks blocked on resolvable issues (missing data, transient errors) for potential human-in-the-loop review.
Communication Bus (Asynchronous Coordination)
Function: Enables inter-agent and agent-kernel asynchronous communication (e.g., via message queues or pub/sub). Uses schemas potentially incorporating semantic types.
Parallelism: Allows agents (running within a "Praxos executor" context) to initiate long-running kernel operations (like complex graph queries or external API calls) and continue processing, receiving results asynchronously via the bus or state updates in the graph. Example: A voice agent initiates a database lookup via a kernel service, continues interacting, and the kernel injects the query result back into the agent's working memory/context when ready.
Potential for easily sending data to external services, such as databases, or else to storage services, without the need for bespoke engineering.
Supporting Modules:
Resource Management: Detailed tracking/enforcement of quotas (API calls, tokens, compute).
Security & Sandboxing: Agent process/data isolation.
Tool Registry: Discoverable catalog of tools, schemas (with semantic types), access policies.
Monitoring & Logging: Centralized, structured observability.
Fault Tolerance: Retry mechanisms, error handling strategies.
Benefits for Builders:
Foundation of Trust: Agents operate on validated, typed data, drastically reducing logical errors.
Massively Reduced Boilerplate: Kernel handles graph interaction, context loading, execution loops, error handling, resource management.
Radically Improved Stability: Semantic type validation, dependency checks, and sandboxing prevent common failure modes.
Efficient Resource Use: Structured graph enables targeted queries; Instant/Fast/Slow call distinction optimizes AI costs/latency.
Tackles Context Limits: Kernel manages loading relevant graph subgraphs ("Working Memory") into agent context.
Enables Complex Workflows: The graph backbone + kernel services facilitate managing task dependencies and multi-agent coordination.
Easier Debugging & Maintenance: Centralized services, structured data, and standardized events improve observability.
Road Map
Solidify Graph State Module & Service Interface: Ensure robust APIs for CRUD, query, locking, propagation operating on the typed graph. Recognize this layer acts as the agent's primary state space ("the stack"), where node IDs are stable pointers. Define mechanisms for efficient working memory sync.
The graph system is able to ingest conversational memory as well as files, in a unified layer, with low latency searches using four different modalities: Embeddings, Wordnet based, LLM query construction, type based search.
The graph is strongly typed, with search allowing primitive types and new types. Both the user and the query manager can create queries using this, which makes further similar queries very fast.
Allow the user to have access to type definitions and tags, growing the graph as well as increasing speed over time.
The system generates and maintains its lemmas for entity deduplication, as well as type resolution, which are directly fed forward to the user, and monitorable.
This search system allows easy fetching of any kind of structured data type, with LLM calls being either unnecessary or only necessary either to generate new derived data, or to resolve decisions with regards to what to include.
The combination of different modalities allows Praxos to be SOTA in both LongMemEval and Locomo [ Not sure if numbers should be here, or further up. Benchmark numbers are a bit strange elsewhere though]
Develop Initial Agent Lifecycle Management: Implement agent templates, instantiation (with dependency checks), async start/stop, basic status tracking.
Use the typing and lifecycle management as a scaffolding for MCP. Types can be used to create signatures, as well as track changes done through external and internal tools. This in turn can be used to massively reduce the search space in tool selection, while catching erroneous calls or failures. This in turn, as well as history of the operation and the transformation, allows for tracking errors, and automatic retries when needed.
Integrate Core Loop: Connect Lifecycle, I/O, and Graph services for basic agent operation.
Develop Basic Task Execution Layer: Define initial typed Instruction Set, build basic Interpreter/Dispatcher.
Implement Basic Scheduler: Simple priority/dependency handling.
Iteratively Enhance: Add sophisticated scheduling, resource management, communication, security, fault tolerance.
We believe that our first step, building a robust data representation that allows fast access to the right data, is the correct first step towards building a semantic runtime. The Praxos Kernel will then build around this foundation, building all the scaffolding for greater stability step by step.
Does this approach – structure the data first, then build the kernel around it – make sense?
How do you see yourself using the type hierarchy when building with AI?
What other kernel services are essential that we didn’t mention? What makes your AI workflows less stable?
Let's discuss!