OpenAI-Compatible Memory Proxy in Rust

SMOS — Semantic Memory Operating System

Memory IS the API. Not a tool.

An OpenAI-compatible proxy that gives any AI coding agent persistent long-term memory — without code changes, without an MCP server, without a framework.

MIT LicenseRust 1.96 / Edition 2024Self-hosted~5 GB disk, no GPU required
$npm install -g @yurvon_screamo/smos
$smos init # downloads ~4 GB of tiny local models
$smos serve # starts on http://localhost:8888
 
# Point Cursor at http://localhost:8888/v1
# Use "bob" as the model name.
# Your assistant now remembers across sessions.

The Problem

Every new chat starts from scratch

Open a new chat in Cursor and your assistant starts from scratch. Switch to Claude Code or opencode and you re-explain why the cache TTL is 10 seconds, not 60 — your architecture, your conventions, every decision you already made.

The model is stateless. The tool is replaceable. The memory should not be.

Without SMOSWith SMOS
Every session: re-explain architectureBob knows your architecture from day one
Switch Cursor → Claude: context lostSwitch tools: Bob stays Bob
Agent must decide what to saveEvery response mined automatically
Memory = a notebook the agent keepsMemory = what actually happened

The Solution

A transparent proxy. Point your base URL at it. Done.

SMOS sits between your AI client and the upstream LLM. Every response is mined for facts automatically — the agent does nothing, the agent forgets nothing. Point any OpenAI-compatible client at SMOS and your assistant remembers across sessions, across tools, across model swaps.

Works with local llama.cpp, OpenAI, OpenRouter, vLLM — any OpenAI-compatible upstream. Run fully local for privacy, or point it at your existing cloud provider.

Client
SMOS
upstream LLM (GPT-4o, Claude, local, …)
1ENRICH

Inject relevant facts from memory into the request

2FORWARD

Stream response back at full LLM speed

3EXTRACT

Mine the response for facts (after delivery)

4FINALIZE

DeBERTa NLI resolves merges and conflicts

Quick Start

Running in 3 commands

step 1
$npm install -g @yurvon_screamo/smos
or: cargo binstall smos
step 2
$smos init
# one-time: downloads ~4 GB
step 3
$smos serve
# starts on http://localhost:8888

Point Cursor, Claude Code, opencode, Cline, or Aider at http://localhost:8888/v1 and use "bob" as the model name.

One prerequisite: llama-server on your PATH. SMOS uses it to run three tiny models locally — no GPU, no API keys, no cloud bills. Prefer cloud? Skip llama-server and configure any OpenAI-compatible provider.

Why SMOS

Five things SMOS does differently

Memory is part of the API, not a tool

Every response is mined for facts automatically. The agent cannot forget to save, because the agent is not involved in saving. Extraction runs off the request path — zero added latency.

No external database

Embedded SurrealDB (RocksDB + HNSW vector index). No Postgres, no Neo4j, no Qdrant, no Docker. One binary, one directory.

Contradictions detected, not overwritten

A DeBERTa-v3 NLI model evaluates each merge candidate. Both sides of a contradiction are preserved and surfaced to the LLM — not silently overwritten.

Multi-persona isolation

Bob for Rust, Alice for ML, Charlie for DevOps — each a separate memory namespace. One SMOS instance, N isolated assistants.

Runs on any laptop

Three tiny local models (4 GB total) handle extraction, embeddings, and reranking on CPU. No GPU, no API keys, no cloud bills. Your conversations never leave your machine.

Comparison

How SMOS compares to other memory systems

A mem0 alternative with a different architecture — proxy vs. tool.

FeatureSMOSmem0LettaZepCognee
ArchitectureProxy (transparent)Tool (agent calls)Framework (runtime)Tool + SaaSTool (pipeline)
External DBNone (embedded)Qdrant / PostgresPostgres + RedisNeo4j (mandatory)Neo4j + Postgres + vector
Code changes neededNone (change base URL)Yes (SDK calls)Yes (adopt runtime)Yes (SDK calls)Yes (pipeline API)
Self-hostedFullyDockerDockerPartial (SaaS-first)Complex multi-DB
Multi-agent isolationBuilt-in (personas)user_id scopingPer-agent identityPer-user graphsNamespaces
Contradiction handlingNLI detection + preservePicks winnerAgent self-editsTemporal invalidationNone explicit
LanguageRustPythonPythonPythonPython
LicenseMITApache-2.0Apache-2.0Apache-2.0Apache-2.0

Star counts and feature sets verified as of June 2026. Each project has different strengths — this table highlights architectural differences, not superiority.

Self-Hosted

No API keys. No cloud bills. No data leaving your machine.

SMOS runs three tiny local models — a 4B extraction LLM, an embedding model, and a reranker. The largest is 4B parameters. These run on a laptop CPU with integrated graphics.

No GPU required. No OpenAI API key. No monthly subscription. Your code, your conversations, your decisions — all stay local.

Prefer cloud? SMOS works with OpenAI, OpenRouter, vLLM, and any OpenAI-compatible provider. The choice is yours.

4B params
Largest local model. Runs on CPU.
No GPU
Tested on Intel integrated graphics.
No API keys
All inference stays on-device.
~4 GB total
Three models: extraction, embedding, reranker.

Research

Built on peer-reviewed research

SMOS is grounded in two recent papers on AI agent memory.

arxiv

MemoryOS

EMNLP 2025 Oral · Kang et al.

Hierarchical memory management for AI agents. SMOS adopts a similar lifecycle (pending → accepted → conflict-flagged) driven by NLI rather than hand-tuned heuristics.

MemoryOS paper
arxiv

The Price of Meaning

Ray Barman et al. · 2026

Proves that vector-only retrieval mathematically degrades through semantic interference. External verification is necessary. SMOS's DeBERTa-v3 NLI layer is that verification.

The Price of Meaning

FAQ

Common questions

Give your AI agent a memory.

Three commands. Five minutes. Your assistant remembers.

$npm install -g @yurvon_screamo/smos
Built with v0