CodebaseLM
← Back to home
Security

Security & Data Handling

Last updated: 2026-05-18

CodebaseLM sends your spoken questions and parts of your repository to AI providers in order to produce a tour. This page states honestly what each provider sees, what CodebaseLM itself stores, and what we don’t. Where a provider retains data by default and we have not yet opted out, we say so plainly.

1. What each provider sees and retains

ProviderWhat it seesDefault retentionOur setting
Deepgram (STT)
Speech-to-text
Your voice and the question you spoke aloudUp to 30 daysZero retention
Set via mip_opt_out=true on every STT connection.
Deepgram (TTS)
Text-to-speech (alternate)
The narration text the LLM produced (may quote code), when Deepgram is in the TTS chain (TTS_PROVIDERS env includes deepgram)Up to 30 daysZero retention
Set via mip_opt_out=true on every TTS connection (both the persistent and ephemeral call paths).
Azure OpenAI
Language model + embeddings
As LLM (when active per llm_provider): your repository content and the question prompt.

As embeddings provider (whenever RETRIEVAL_ENABLED=true, regardless of which LLM is active): the raw text of every indexed code chunk, and the text of your question, sent to Azure’s embeddings endpoint to power retrieval search. This path runs even when Claude is the active language model.
30 days for abuse monitoring (Microsoft default). Plus a separate 24-hour provider-side prompt cache that we enable via prompt_cache_retention: "24h" only on public-repo LLM calls, so the repo-context prefix can be re-used cheaply across requests. For private repos the cache fields are omitted — private source code in the prompt is never written to Azure’s 24h prompt cache. Embedding calls do not use the prompt cache.MIM opt-out pending
Application submitted to Microsoft; approval timeline is outside our control. The 24h prompt cache stays on for public-repo LLM calls (cost reasons; it lives in Microsoft’s infrastructure under their default abuse-logging policy) but is disabled on private-repo calls so private source code is never cached on Azure’s side. When MIM approval lands, it applies to both the LLM and embeddings endpoints.
Azure HD TTS
Text-to-speech
The narration text the LLM produced (may quote code)Zero, per Microsoft real-time synthesis contractZero retention
Default behavior of the real-time synthesis API; no configuration required.
Anthropic Claude
Language model (alternate)
Your repository content and the question prompt when the operator selects Claude as the active language model (per the llm_provider system config or LLM_PROVIDER env). When Claude is active, Anthropic sees 100% of repo prompts and chat questions; otherwise Anthropic sees nothing. The two LLMs are mutually exclusive at chat time. Note that Azure embeddings still run independently when retrieval is enabled, regardless of which LLM is active — see the Azure OpenAI row above.30 days for API customers. Plus a separate 1-hour ephemeral prompt cache that we explicitly enable via cache_control markers (TTL 1h) on system, repo-context, and tool blocks, so the prefix can be re-used cheaply across requests.Default
No Zero-Data-Retention tier is available on CodebaseLM’s Anthropic tier today. The 1h prompt cache stays on for cost reasons; it lives in Anthropic’s infrastructure under their default policy.

2. What CodebaseLM stores

In our Postgres database

On our server's disk

In our TTS audio cache

3. What CodebaseLM does NOT store

4. What we do NOT do

5. In-flight privacy improvements

6. Questions

For privacy questions email privacy@codebaselm.ai. See also the full Privacy Policy and Terms of Service.