Does SMOS work with cloud LLMs like OpenAI and Claude?

Yes. SMOS is an OpenAI-compatible proxy. Point it at any upstream — local llama.cpp, OpenAI, OpenRouter, vLLM. Run fully local for privacy, or use your existing cloud provider.

No. The three local models (extraction, embeddings, reranking) are tiny — the largest is 4B parameters. They run on a laptop CPU with integrated graphics.

How is this different from mem0 or Letta?

mem0 and Letta are tools the agent must decide to call. SMOS is a transparent proxy: every response is mined for facts automatically. The agent is not involved in saving. Additionally, SMOS requires zero external databases — it is a single binary with an embedded database.

What about MCP servers?

SMOS doesn't need one. Because it's a proxy (not a tool), there's no MCP server to configure. Point your AI client's base URL at SMOS and you're done.

Is it production-ready?

SMOS is a young project (v0.1.7). The architecture is production-oriented — hexagonal DDD, compile-enforced layering, fail-open contract, 665+ tests. But it has not been battle-tested at scale yet.

Can I use it with multiple agents?

Yes. Each "person" in SMOS is a memory namespace. Create Bob for Rust, Alice for ML, Charlie for DevOps — each isolated. One SMOS instance serves N assistants.

SMOSv0.1.7

OpenAI-Compatible Memory Proxy in Rust

SMOS — Semantic Memory Operating System

Memory IS the API. Not a tool.

An OpenAI-compatible proxy that gives any AI coding agent persistent long-term memory — without code changes, without an MCP server, without a framework.

Get Started GitHub

MIT LicenseRust 1.96 / Edition 2024Self-hosted~5 GB disk, no GPU required

$npm install -g @yurvon_screamo/smos

$smos init # downloads ~4 GB of tiny local models

$smos serve # starts on http://localhost:8888

# Point Cursor at http://localhost:8888/v1

# Use "bob" as the model name.

# Your assistant now remembers across sessions.

The Problem

Every new chat starts from scratch

Open a new chat in Cursor and your assistant starts from scratch. Switch to Claude Code or opencode and you re-explain why the cache TTL is 10 seconds, not 60 — your architecture, your conventions, every decision you already made.

The model is stateless. The tool is replaceable. The memory should not be.

Without SMOS	With SMOS
Every session: re-explain architecture	Bob knows your architecture from day one
Switch Cursor → Claude: context lost	Switch tools: Bob stays Bob
Agent must decide what to save	Every response mined automatically
Memory = a notebook the agent keeps	Memory = what actually happened

The Solution

A transparent proxy. Point your base URL at it. Done.

SMOS sits between your AI client and the upstream LLM. Every response is mined for facts automatically — the agent does nothing, the agent forgets nothing. Point any OpenAI-compatible client at SMOS and your assistant remembers across sessions, across tools, across model swaps.

Works with local llama.cpp, OpenAI, OpenRouter, vLLM — any OpenAI-compatible upstream. Run fully local for privacy, or point it at your existing cloud provider.

Client

→

SMOS

→

upstream LLM (GPT-4o, Claude, local, …)

1ENRICH

Inject relevant facts from memory into the request

2FORWARD

Stream response back at full LLM speed

3EXTRACT

Mine the response for facts (after delivery)

4FINALIZE

DeBERTa NLI resolves merges and conflicts

Quick Start

Running in 3 commands

step 1

$npm install -g @yurvon_screamo/smos

or: cargo binstall smos

step 2

$smos init

# one-time: downloads ~4 GB

step 3

$smos serve

# starts on http://localhost:8888

Point Cursor, Claude Code, opencode, Cline, or Aider at http://localhost:8888/v1 and use "bob" as the model name.

One prerequisite: llama-server on your PATH. SMOS uses it to run three tiny models locally — no GPU, no API keys, no cloud bills. Prefer cloud? Skip llama-server and configure any OpenAI-compatible provider.

Why SMOS

Five things SMOS does differently

Memory is part of the API, not a tool

Every response is mined for facts automatically. The agent cannot forget to save, because the agent is not involved in saving. Extraction runs off the request path — zero added latency.

No external database

Embedded SurrealDB (RocksDB + HNSW vector index). No Postgres, no Neo4j, no Qdrant, no Docker. One binary, one directory.

Contradictions detected, not overwritten

A DeBERTa-v3 NLI model evaluates each merge candidate. Both sides of a contradiction are preserved and surfaced to the LLM — not silently overwritten.

Multi-persona isolation

Bob for Rust, Alice for ML, Charlie for DevOps — each a separate memory namespace. One SMOS instance, N isolated assistants.

Runs on any laptop

Three tiny local models (4 GB total) handle extraction, embeddings, and reranking on CPU. No GPU, no API keys, no cloud bills. Your conversations never leave your machine.

Comparison

How SMOS compares to other memory systems

A mem0 alternative with a different architecture — proxy vs. tool.

Feature	SMOS	mem0	Letta	Zep	Cognee
Architecture	Proxy (transparent)	Tool (agent calls)	Framework (runtime)	Tool + SaaS	Tool (pipeline)
External DB	None (embedded)	Qdrant / Postgres	Postgres + Redis	Neo4j (mandatory)	Neo4j + Postgres + vector
Code changes needed	None (change base URL)	Yes (SDK calls)	Yes (adopt runtime)	Yes (SDK calls)	Yes (pipeline API)
Self-hosted	Fully	Docker	Docker	Partial (SaaS-first)	Complex multi-DB
Multi-agent isolation	Built-in (personas)	user_id scoping	Per-agent identity	Per-user graphs	Namespaces
Contradiction handling	NLI detection + preserve	Picks winner	Agent self-edits	Temporal invalidation	None explicit
Language	Rust	Python	Python	Python	Python
License	MIT	Apache-2.0	Apache-2.0	Apache-2.0	Apache-2.0

Star counts and feature sets verified as of June 2026. Each project has different strengths — this table highlights architectural differences, not superiority.

Self-Hosted

No API keys. No cloud bills. No data leaving your machine.

SMOS runs three tiny local models — a 4B extraction LLM, an embedding model, and a reranker. The largest is 4B parameters. These run on a laptop CPU with integrated graphics.

No GPU required. No OpenAI API key. No monthly subscription. Your code, your conversations, your decisions — all stay local.

Prefer cloud? SMOS works with OpenAI, OpenRouter, vLLM, and any OpenAI-compatible provider. The choice is yours.

4B params

Largest local model. Runs on CPU.

No GPU

Tested on Intel integrated graphics.

No API keys

All inference stays on-device.

~4 GB total

Three models: extraction, embedding, reranker.

Research

Built on peer-reviewed research

SMOS is grounded in two recent papers on AI agent memory.

arxiv

MemoryOS

EMNLP 2025 Oral · Kang et al.

Hierarchical memory management for AI agents. SMOS adopts a similar lifecycle (pending → accepted → conflict-flagged) driven by NLI rather than hand-tuned heuristics.

MemoryOS paper

arxiv

The Price of Meaning

Ray Barman et al. · 2026

Proves that vector-only retrieval mathematically degrades through semantic interference. External verification is necessary. SMOS's DeBERTa-v3 NLI layer is that verification.

The Price of Meaning

FAQ

Common questions

Give your AI agent a memory.

Three commands. Five minutes. Your assistant remembers.

$npm install -g @yurvon_screamo/smos

Star on GitHub Read the docs