Security
Security & Data Handling
Last updated: 2026-05-18
CodebaseLM sends your spoken questions and parts of your repository to AI providers in order to produce a tour. This page states honestly what each provider sees, what CodebaseLM itself stores, and what we don’t. Where a provider retains data by default and we have not yet opted out, we say so plainly.
1. What each provider sees and retains
| Provider | What it sees | Default retention | Our setting |
|---|---|---|---|
Deepgram (STT) Speech-to-text | Your voice and the question you spoke aloud | Up to 30 days | Zero retention Set via mip_opt_out=true on every STT connection. |
Deepgram (TTS) Text-to-speech (alternate) | The narration text the LLM produced (may quote code), when Deepgram is in the TTS chain (TTS_PROVIDERS env includes deepgram) | Up to 30 days | Zero retention Set via mip_opt_out=true on every TTS connection (both the persistent and ephemeral call paths). |
Azure OpenAI Language model + embeddings | As LLM (when active per llm_provider): your repository content and the question prompt.As embeddings provider (whenever RETRIEVAL_ENABLED=true, regardless of which LLM is active): the raw text of every indexed code chunk, and the text of your question, sent to Azure’s embeddings endpoint to power retrieval search. This path runs even when Claude is the active language model. | 30 days for abuse monitoring (Microsoft default). Plus a separate 24-hour provider-side prompt cache that we enable via prompt_cache_retention: "24h" only on public-repo LLM calls, so the repo-context prefix can be re-used cheaply across requests. For private repos the cache fields are omitted — private source code in the prompt is never written to Azure’s 24h prompt cache. Embedding calls do not use the prompt cache. | MIM opt-out pending Application submitted to Microsoft; approval timeline is outside our control. The 24h prompt cache stays on for public-repo LLM calls (cost reasons; it lives in Microsoft’s infrastructure under their default abuse-logging policy) but is disabled on private-repo calls so private source code is never cached on Azure’s side. When MIM approval lands, it applies to both the LLM and embeddings endpoints. |
Azure HD TTS Text-to-speech | The narration text the LLM produced (may quote code) | Zero, per Microsoft real-time synthesis contract | Zero retention Default behavior of the real-time synthesis API; no configuration required. |
Anthropic Claude Language model (alternate) | Your repository content and the question prompt when the operator selects Claude as the active language model (per the llm_provider system config or LLM_PROVIDER env). When Claude is active, Anthropic sees 100% of repo prompts and chat questions; otherwise Anthropic sees nothing. The two LLMs are mutually exclusive at chat time. Note that Azure embeddings still run independently when retrieval is enabled, regardless of which LLM is active — see the Azure OpenAI row above. | 30 days for API customers. Plus a separate 1-hour ephemeral prompt cache that we explicitly enable via cache_control markers (TTL 1h) on system, repo-context, and tool blocks, so the prefix can be re-used cheaply across requests. | Default No Zero-Data-Retention tier is available on CodebaseLM’s Anthropic tier today. The 1h prompt cache stays on for cost reasons; it lives in Anthropic’s infrastructure under their default policy. |
2. What CodebaseLM stores
In our Postgres database
- Account record: email, name, the OAuth provider you signed in with (Google or GitHub), and the provider-side account id
- OAuth tokens from your identity provider:
access_token,refresh_token,id_token, grantedscope,expires_at, and related fields. Stored on theAccountrow that NextAuth/Auth.js creates when you sign in. Used to make authenticated requests on your behalf (e.g. reading your private repos when that ships). - Tour records: a per-tour row with an id, your user id, the GitHub owner and repo, a timestamp, and a count of questions asked. The question-and-answer conversation itself is not persisted in this table.
- Billing records, for paid subscribers
- Code chunks for retrieval (when retrieval indexing is enabled on a deployment) — stored in the
repo_chunkstable. These are byte ranges of files considered relevant for LLM context, keyed by repository and commit SHA. They are not linked to any individual user. They persist indefinitely once indexed; we currently have no automatic eviction, and prune them manually when the chunker or embedding version changes. - GitHubInstallation — per-user GitHub App installation metadata:
installationId,installedAt, and optionaldeletedAt/suspendedAt. No tokens are stored on this row — short-lived installation access tokens are minted on demand from our App’s private key and held in server memory only. (Layer 2a — GitHub App install support.) - NextAuth Account rows for provider=“github” — created when a user links their GitHub identity during the Connect flow. These rows carry the user’s GitHub OAuth token in
access_token, inheriting the same plaintext-at-rest exposure called out under “In-flight privacy improvements” below for Google OAuth tokens. Cleanup is tracked in a separate security-hardening spec. (Layer 2a — Connect-GitHub identity verification.) - PrivateTourToken — one row per private tour. Columns:
token(a 22-character opaque URL identifier),tourId(FK toTour),userId(FK toUser, retained as creator-metadata only — not used as an ACL),createdAt, andlastViewedAt. Lets a private-tour URL like/t/abc123resolve to (owner, repo) on the server without exposing the repository name in the URL bar, browser history, PostHog$pageviewevents, or referrer headers. Cascade-deletes with theTourrow. (Layer 2b-i — private-repo URL hardening.)
On our server's disk
- L1 repo cache at
${CACHE_DIR}/l1: cached file lists and the file contents we read while generating your tour. The in-memory portion holds the 100 most-recently-used repositories (LRU). The on-disk portion is keyed by commit SHA, so a new commit produces a fresh file alongside the old one; on-disk entries have no automatic eviction and persist until we manually prune them. - L2 LLM response cache at
${CACHE_DIR}/l2-${L2_CACHE_VERSION}: cached LLM responses. The response payload includes the narration text and any code snippets the model quoted back. Any first turn of a fresh session that satisfies all of the following triggers a write: no diagram open, not asking about a specific diagram node, retrieval state is stable, the stream completed without truncation, and at least one segment was generated. So custom first questions can be cached too, not only the canonical auto-intro. Up to 200 entries in memory (LRU); on disk, entries are versioned byL2_CACHE_VERSIONand sharded by prompt hash. Bumping the version orphans the prior generation but does not delete files — a manual disk sweep is a separate operator action. - Node-explain cache at
${CACHE_DIR}/node-explain-v2: when you click a node in the architecture diagram, CodebaseLM generates a per-node explanation and persists it so the next visitor who clicks the same node on the same commit gets an instant reply. Keyed by commit SHA, node id, and the prompt- input hash; entries are not linked to any individual user. Up to 200 in memory (LRU); on disk, entries persist until the cache version is bumped — a manual operator action — with no automatic eviction. - BM25 index cache at
${CACHE_DIR}/bm25/${buildId}.bm25.json: when retrieval indexing is enabled, the indexer also writes a BM25 keyword-search index alongside the embeddings. The serialized JSON includes the raw text of every indexed code chunk, so this is a second on-disk copy of the chunk text outside therepo_chunksPostgres table. Keyed by build id; one file per index build, no automatic eviction.
In our TTS audio cache
- Synthesized narration audio (PCM) is kept in memory (LRU) for the duration of the server process so repeated playback of the same line does not re-synthesize. When the
AZURE_STORAGE_ACCOUNTenv var is configured, the same audio is also persisted to the Azure Blob containertts-cache(configurable viaAZURE_STORAGE_TTS_CONTAINER). Entries are keyed by provider, voice, format, and the normalized narration text. Cache-version bumps are operator driven (seeAZURE_TTS_MODEL_REV); old blobs are not auto-evicted.
3. What CodebaseLM does NOT store
- Your spoken-question audio — Deepgram processes it to text and does not retain audio or transcript (
mip_opt_out=true) - Your spoken-form questions as text — accepted by the WebSocket and processed in memory only; not written to any persistent store on our side
4. What we do NOT do
- Train models on your code, your questions, or your tour transcripts
- Sell or share your data with parties beyond the processors listed above
- Access your repositories for any purpose other than generating the tour you requested
5. In-flight privacy improvements
- Azure OpenAI Modified Abuse Monitoring (MIM) opt-out — application submitted; pending Microsoft approval. When approved, the “Our setting” cell for Azure OpenAI above will change from “pending” to “Zero retention.”
- Automatic eviction across all six stores (L1 repo cache, L2 LLM response cache, node-explain cache, BM25 index cache, TTS audio cache, and
repo_chunks) — planned. Until that ships, all six are pruned manually when versions change. - Bring-your-own LLM key — planned for a future version. Lets customers route their tours through their own Azure OpenAI deployment.
- SOC 2 Type II — on our roadmap; not yet started.
6. Questions
For privacy questions email privacy@codebaselm.ai. See also the full Privacy Policy and Terms of Service.