Security

Security & Data Handling

Last updated: 2026-06-03

CodebaseLM sends your spoken questions and parts of your repository to AI providers in order to produce a tour. This page states honestly what each provider sees, what CodebaseLM itself stores, and what we don’t. Where a provider retains data by default and we have not yet opted out, we say so plainly.

1. What each provider sees and retains

Provider	What it sees	Default retention	Our setting
Deepgram (STT) Speech-to-text	Your voice and the question you spoke aloud	Up to 30 days	Zero retention Set via `mip_opt_out=true` on every STT connection.
Deepgram (TTS) Text-to-speech (alternate)	The narration text the LLM produced (may quote code), when Deepgram is in the TTS chain (`TTS_PROVIDERS` env includes `deepgram`)	Up to 30 days	Zero retention Set via `mip_opt_out=true` on every TTS connection (both the persistent and ephemeral call paths).
Azure OpenAI Language model + embeddings	As LLM (when active per `llm_provider`): your repository content and the question prompt. As embeddings provider (whenever `RETRIEVAL_ENABLED=true`, regardless of which LLM is active): the raw text of every indexed code chunk, and the text of your question, sent to Azure’s embeddings endpoint to power retrieval search. This path runs even when Claude is the active language model.	30 days for abuse monitoring (Microsoft default). Plus a separate 24-hour provider-side prompt cache that we enable via `prompt_cache_retention: "24h"` only on public-repo LLM calls, so the repo-context prefix can be re-used cheaply across requests. For private repos the cache fields are omitted — private source code in the prompt is never written to Azure’s 24h prompt cache. Embedding calls do not use the prompt cache.	MIM opt-out pending Application submitted to Microsoft; approval timeline is outside our control. The 24h prompt cache stays on for public-repo LLM calls (cost reasons; it lives in Microsoft’s infrastructure under their default abuse-logging policy) but is disabled on private-repo calls so private source code is never cached on Azure’s side. When MIM approval lands, it applies to both the LLM and embeddings endpoints.
Azure HD TTS Text-to-speech	The narration text the LLM produced (may quote code)	Zero, per Microsoft real-time synthesis contract	Zero retention Default behavior of the real-time synthesis API; no configuration required.
Anthropic Claude Language model (alternate)	Your repository content and the question prompt when the operator selects Claude as the active language model (per the `llm_provider` system config or `LLM_PROVIDER` env). When Claude is active, Anthropic sees 100% of repo prompts and chat questions; otherwise Anthropic sees nothing. The two LLMs are mutually exclusive at chat time. Note that Azure embeddings still run independently when retrieval is enabled, regardless of which LLM is active — see the Azure OpenAI row above.	30 days for API customers. Plus a separate 1-hour ephemeral prompt cache that we explicitly enable via `cache_control` markers (TTL `1h`) on system, repo-context, and tool blocks, so the prefix can be re-used cheaply across requests.	Default No Zero-Data-Retention tier is available on CodebaseLM’s Anthropic tier today. The 1h prompt cache stays on for cost reasons; it lives in Anthropic’s infrastructure under their default policy.

2. What CodebaseLM stores

In our Postgres database

Account record: email, name, the OAuth provider you signed in with (Google or GitHub), and the provider-side account id
Linked OAuth providers (no token retention) — RepoTour’s Account table tracks which OAuth providers a user has linked (Google, GitHub) but does NOT retain provider OAuth tokens. The token from each sign-in is used inline to verify the user’s email (Google: via OIDC email_verified; GitHub: via a re-fetch of /user/emails requiring primary && verified), then discarded. Any subsequent GitHub repo API call uses an installation token minted via the GitHub App, never the user OAuth token. (Layer 2b-ii — data-at-rest hardening.)
Tour records: a per-tour row with an id, your user id, the GitHub owner and repo, a timestamp, and a count of questions asked. The question-and-answer conversation itself is not persisted in this table.
Billing records, for paid subscribers
Code chunks for retrieval (when retrieval indexing is enabled on a deployment) — stored in the repo_chunks table. These are byte ranges of files considered relevant for LLM context, keyed by repository and commit SHA. They are not linked to any individual user. They persist indefinitely once indexed; we currently have no automatic eviction, and prune them manually when the chunker or embedding version changes.
GitHubInstallation — per-user GitHub App installation metadata: installationId, installedAt, and optional deletedAt/suspendedAt. No tokens are stored on this row — short-lived installation access tokens are minted on demand from our App’s private key and held in server memory only. (Layer 2a — GitHub App install support.)
NextAuth Account rows for provider=“github” — created when a user links their GitHub identity during the Connect flow. These rows record the linkage (provider, provider account id, granted scope, token expiry) but do NOT persist the user’s GitHub OAuth token. The token is consumed inline during sign-in to re-fetch /user/emails for primary && verified confirmation, then discarded. Subsequent GitHub repo API calls use a short-lived installation token from the GitHub App, never the user OAuth token. (Layer 2a — Connect-GitHub identity verification; Layer 2b-ii — data-at-rest hardening.)
PrivateTourToken — one row per private tour. Columns: token (a 22-character opaque URL identifier), tourId (FK to Tour), userId (FK to User, retained as creator-metadata only — not used as an ACL), createdAt, and lastViewedAt. Lets a private-tour URL like /t/abc123 resolve to (owner, repo) on the server without exposing the repository name in the URL bar, browser history, PostHog $pageview events, or referrer headers. Cascade-deletes with the Tour row. (Layer 2b-i — private-repo URL hardening.)

Private repositories: cached durably, but isolated and yours alone.

For a private repo, code and code-derived content — file contents, LLM narration, per-node explanations, audit results, retrieval chunks, and synthesized narration audio — is cached so repeat visits are fast. It is stored under per-owner hashed cache paths on encrypted infrastructure, accessible only via your GitHub App install-grant, never used to train AI, and deleted when you remove the repo from the install, uninstall the app, or delete your account. The cache locations described below are the same stores used for public repositories.

On our server's disk

L1 repo cache at ${CACHE_DIR}/l1: cached file lists and the file contents we read while generating your tour. The in-memory portion holds the 100 most-recently-used repositories (LRU). The on-disk portion is keyed by commit SHA, so a new commit produces a fresh file alongside the old one; on-disk entries have no automatic eviction and persist until we manually prune them.
L2 LLM response cache at ${CACHE_DIR}/l2-${L2_CACHE_VERSION}: cached LLM responses. The response payload includes the narration text and any code snippets the model quoted back. Any first turn of a fresh session that satisfies all of the following triggers a write: no diagram open, not asking about a specific diagram node, retrieval state is stable, the stream completed without truncation, and at least one segment was generated. So custom first questions can be cached too, not only the canonical auto-intro. Up to 200 entries in memory (LRU); on disk, entries are versioned by L2_CACHE_VERSION and sharded by prompt hash. Bumping the version orphans the prior generation but does not delete files — a manual disk sweep is a separate operator action.
Node-explain cache at ${CACHE_DIR}/node-explain-v2: when you click a node in the architecture diagram, CodebaseLM generates a per-node explanation and persists it so the next visitor who clicks the same node on the same commit gets an instant reply. Keyed by commit SHA, node id, and the prompt- input hash; entries are not linked to any individual user. Up to 200 in memory (LRU); on disk, entries persist until the cache version is bumped — a manual operator action — with no automatic eviction.
BM25 index cache at ${CACHE_DIR}/bm25/${buildId}.bm25.json: when retrieval indexing is enabled, the indexer also writes a BM25 keyword-search index alongside the embeddings. The serialized JSON includes the raw text of every indexed code chunk, so this is a second on-disk copy of the chunk text outside the repo_chunks Postgres table. Keyed by build id; one file per index build, no automatic eviction.

In our TTS audio cache

Synthesized narration audio (PCM) is kept in memory (LRU) for the duration of the server process so repeated playback of the same line does not re-synthesize. When the AZURE_STORAGE_ACCOUNT env var is configured, the same audio is also persisted to the Azure Blob container tts-cache (configurable via AZURE_STORAGE_TTS_CONTAINER). Entries are keyed by provider, voice, format, and the normalized narration text. Cache-version bumps are operator driven (see AZURE_TTS_MODEL_REV); old blobs are not auto-evicted.

3. What CodebaseLM does NOT store

Your spoken-question audio — Deepgram processes it to text and does not retain audio or transcript (mip_opt_out=true)
Your spoken-form questions as text — accepted by the WebSocket and processed in memory only; not written to any persistent store on our side

4. What we do NOT do

Train models on your code, your questions, or your tour transcripts
Sell or share your data with parties beyond the processors listed above
Access your repositories for any purpose other than generating the tour you requested

5. In-flight privacy improvements

Azure OpenAI Modified Abuse Monitoring (MIM) opt-out — application submitted; pending Microsoft approval. When approved, the “Our setting” cell for Azure OpenAI above will change from “pending” to “Zero retention.”
Automatic eviction across all six stores (L1 repo cache, L2 LLM response cache, node-explain cache, BM25 index cache, TTS audio cache, and repo_chunks) — planned. Until that ships, all six are pruned manually when versions change.
Bring-your-own LLM key — planned for a future version. Lets customers route their tours through their own Azure OpenAI deployment.
SOC 2 Type II — on our roadmap; not yet started.

6. Questions

For privacy questions email privacy@codebaselm.ai. See also the full Privacy Policy and Terms of Service.