Benchmarks: The Praxos Advantage
For developers building next-generation AI agents and workflows, choosing the right components is everything. As we build more and more of our Kernel, we will benchmark our performance on each component and update this page. We believe that knowing how well we perform helps builders decide if what we're building is useful to them.
LongMemEval: The Definitive Test of Conversational Memory
What It Is: LongMemEval is a benchmark for evaluating an agent's ability to remember and evolve through conversation. It measures how well a model maintains consistency and personalization over time.
What Good Performance Means for You: This is the difference between building a generic chatbot and a true AI assistant. A high score means your agent can build lasting relationships with users, delivering experiences that feel intelligent and personal. It also means that it remembers temporal information as well as facts the user would expect it to know.
Praxos shows significant improvement on all categories except knowledge updates compared to gpt-4o baseline:
Question Type | Praxos | Full-context (gpt-4o) | Delta (%) | Builder Benefit |
single-session-preference | 80.00% | 20.00% | +300% | Build agents that truly learn user preferences. If a user says they prefer concise summaries during a conversation, your agent remembers and adapts its behavior instantly and consistently. |
temporal-reasoning | 62.40% | 45.10% | +38.36% | Create time-aware agents. Your agent can understand "what did we discuss last Tuesday?" or "remind me two weeks from now," so you can do more sophisticated schedule-based actions, or even just improve the conversation experience. |
multi-session | 59.30% | 44.30% | +33.86% | Eliminate repetitive interactions. Build agents that remember context across conversations days or weeks apart, providing a seamless and intelligent user experience without asking the same questions over and over. |
single-session-user | 88.60% | 81.40% | +8.83% | Ensure high-fidelity recall. Your agent reliably remembers specific facts and details a user has provided within a conversation, building trust and demonstrating attentiveness. |
single-session-assistant | 96.40% | 94.60% | +1.90% | Maintain conversational consistency. The agent remembers what it has said, preventing self-contradiction and ensuring logical, coherent dialogue. |
knowledge-update | 76.90% | 78.20% | -1.66% | Praxos performs on par with the best models at updating its knowledge with new facts, ensuring your agent can be corrected and stay up-to-date. |
Weighted Aggregate | 71.80% | ~60.6% | +17.49% | Build fundamentally better memory-powered applications. Across the board, Praxos delivers a superior, more reliable memory capability than simply using a large context window with a leading model. |
Superior accuracy, however, is only half the story. The other half of the Praxos advantage lies in efficiency gains.
Our advanced memory architecture achieves these benchmark-beating results while reducing context window usage by more than 90%.
What This Means for You as a Builder:
Massive Cost Savings: Drastically lower your token costs for every interaction. Your operational expenses will plummet, making it feasible to deploy sophisticated agents at scale.
Lower Latency: Smaller context windows mean faster processing. You can build agents that respond more quickly, leading to a much better user experience.
Greater Complexity: By offloading memory management from the context window, you free up valuable space for more complex instructions, tools, and real-time data, allowing you to build more powerful and capable agents.