Memory is part of the API, not a tool
Every response is mined for facts automatically. The agent cannot forget to save, because the agent is not involved in saving. Extraction runs off the request path — zero added latency.
Memory IS the API. Not a tool.
An OpenAI-compatible proxy that gives any AI coding agent persistent long-term memory — without code changes, without an MCP server, without a framework.
The Problem
Open a new chat in Cursor and your assistant starts from scratch. Switch to Claude Code or opencode and you re-explain why the cache TTL is 10 seconds, not 60 — your architecture, your conventions, every decision you already made.
The model is stateless. The tool is replaceable. The memory should not be.
| Without SMOS | With SMOS |
|---|---|
| Every session: re-explain architecture | Bob knows your architecture from day one |
| Switch Cursor → Claude: context lost | Switch tools: Bob stays Bob |
| Agent must decide what to save | Every response mined automatically |
| Memory = a notebook the agent keeps | Memory = what actually happened |
The Solution
SMOS sits between your AI client and the upstream LLM. Every response is mined for facts automatically — the agent does nothing, the agent forgets nothing. Point any OpenAI-compatible client at SMOS and your assistant remembers across sessions, across tools, across model swaps.
Works with local llama.cpp, OpenAI, OpenRouter, vLLM — any OpenAI-compatible upstream. Run fully local for privacy, or point it at your existing cloud provider.
Inject relevant facts from memory into the request
Stream response back at full LLM speed
Mine the response for facts (after delivery)
DeBERTa NLI resolves merges and conflicts
Quick Start
Point Cursor, Claude Code, opencode, Cline, or Aider at http://localhost:8888/v1 and use "bob" as the model name.
One prerequisite: llama-server on your PATH. SMOS uses it to run three tiny models locally — no GPU, no API keys, no cloud bills. Prefer cloud? Skip llama-server and configure any OpenAI-compatible provider.
Why SMOS
Every response is mined for facts automatically. The agent cannot forget to save, because the agent is not involved in saving. Extraction runs off the request path — zero added latency.
Embedded SurrealDB (RocksDB + HNSW vector index). No Postgres, no Neo4j, no Qdrant, no Docker. One binary, one directory.
A DeBERTa-v3 NLI model evaluates each merge candidate. Both sides of a contradiction are preserved and surfaced to the LLM — not silently overwritten.
Bob for Rust, Alice for ML, Charlie for DevOps — each a separate memory namespace. One SMOS instance, N isolated assistants.
Three tiny local models (4 GB total) handle extraction, embeddings, and reranking on CPU. No GPU, no API keys, no cloud bills. Your conversations never leave your machine.
Comparison
A mem0 alternative with a different architecture — proxy vs. tool.
| Feature | SMOS | mem0 | Letta | Zep | Cognee |
|---|---|---|---|---|---|
| Architecture | Proxy (transparent) | Tool (agent calls) | Framework (runtime) | Tool + SaaS | Tool (pipeline) |
| External DB | None (embedded) | Qdrant / Postgres | Postgres + Redis | Neo4j (mandatory) | Neo4j + Postgres + vector |
| Code changes needed | None (change base URL) | Yes (SDK calls) | Yes (adopt runtime) | Yes (SDK calls) | Yes (pipeline API) |
| Self-hosted | Fully | Docker | Docker | Partial (SaaS-first) | Complex multi-DB |
| Multi-agent isolation | Built-in (personas) | user_id scoping | Per-agent identity | Per-user graphs | Namespaces |
| Contradiction handling | NLI detection + preserve | Picks winner | Agent self-edits | Temporal invalidation | None explicit |
| Language | Rust | Python | Python | Python | Python |
| License | MIT | Apache-2.0 | Apache-2.0 | Apache-2.0 | Apache-2.0 |
Star counts and feature sets verified as of June 2026. Each project has different strengths — this table highlights architectural differences, not superiority.
Self-Hosted
SMOS runs three tiny local models — a 4B extraction LLM, an embedding model, and a reranker. The largest is 4B parameters. These run on a laptop CPU with integrated graphics.
No GPU required. No OpenAI API key. No monthly subscription. Your code, your conversations, your decisions — all stay local.
Prefer cloud? SMOS works with OpenAI, OpenRouter, vLLM, and any OpenAI-compatible provider. The choice is yours.
Research
SMOS is grounded in two recent papers on AI agent memory.
EMNLP 2025 Oral · Kang et al.
Hierarchical memory management for AI agents. SMOS adopts a similar lifecycle (pending → accepted → conflict-flagged) driven by NLI rather than hand-tuned heuristics.
MemoryOS paperRay Barman et al. · 2026
Proves that vector-only retrieval mathematically degrades through semantic interference. External verification is necessary. SMOS's DeBERTa-v3 NLI layer is that verification.
The Price of MeaningFAQ
Three commands. Five minutes. Your assistant remembers.