Skip to the content.

Code Intelligence Roadmap

coding-ethos should grow from policy enforcement into a local code intelligence substrate for agents. The target is not a generic RAG database. The target is ETHOS-grounded code and remediation memory that helps agents find the right code, understand prior failures, and choose the enforced repair path before they run broad shell commands or repeat failed edits.

Design Position

Use one repo-local storage substrate with separate logical responsibilities:

Vector indexes are derived artifacts. They must be rebuildable from the canonical DuckDB store and the checked-out repository contents.

Goals

Non-Goals

Architecture

repository
  files
  traces
  SARIF
  policy bundle
      |
      v
Tree-sitter AST extraction
      |
      +--> DuckDB canonical store
      |      files
      |      symbols
      |      AST nodes
      |      graph edges
      |      decisions
      |      remediations
      |      attempts
      |      DuckDB term indexes
      |
      +--> duckdb-vss vector backend
             code/remediation vectors
             metadata-filtered similarity search
	     pure-Go github.com/duckdb/duckdb-go/v2

Foundation Already In Place

The first enabling layer is the normalized evidence model in go/internal/evidence. It gives CEL, SARIF, traces, lint diagnostics, hook decisions, and future code intelligence one shared contract:

SARIF result properties now carry the normalized finding, source span, search text, remediation payload, and remediation events. Hook and lint traces carry a schema version, trace ID, normalized findings, remediation summaries, and remediation events. This makes code intelligence an ingestion problem instead of a second policy interpretation layer.

CEL inputs also expose code-intelligence fields under source, finding, and edit/diff facts. proposed_symbol_changes compares current and proposed Tree-sitter symbols for Edit/Write/MultiEdit actions, while changed_symbols maps staged diff hunks to the affected Tree-sitter symbols. Both surfaces report the file, language, node kind, symbol kind/name/path, line spans, content hashes, action (added, deleted, or modified), and line-count delta. This lets principle-owned CEL block growth of oversized functions, classes/types, shell functions, and YAML config entries while still allowing refactors that shrink large files.

Source-aware policy follows AST_CEL_SARIF_ARCHITECTURE.md: Go collects Tree-sitter facts, CEL evaluates configurable predicates, and SARIF reports stable AST-backed findings. Code-intel storage and MCP retrieval build on those same facts; they must not become a second parsing or policy interpretation path.

The AST layer follows the resolver pattern proven in ~/Active/pyqa_lint: language detection, parser binding, parser reuse, tree traversal, and line-to-nearest-context lookup live behind one Go entrypoint. The active resolver supports Go, Python, JavaScript/TypeScript, shell, YAML, JSON, and TOML. JSON and TOML are treated as first-class config-policy surfaces, so agents can retrieve precise config entries instead of reading whole config files. Markdown remains intentionally deferred until the project selects a maintained Go binding or a first-class adapter for its parser layout.

Tree-sitter-backed policy diagnostics carry AST metadata into SARIF result properties and partial fingerprints. Code scanning can therefore track the symbol-level finding across unrelated line movement instead of treating every nearby edit as a new whole-file violation.

Python policy enforcement now uses the same AST foundation for import and functional-idiom rules. The python.conditional_imports evaluator blocks write-time attempts to introduce nested imports, TYPE_CHECKING import branches, module __getattr__ shims, __import__, and importlib.import_module. The python.functional_idioms evaluator records assigned lambdas and closure factories so agents get principle-grounded advice to use functools, operator, or itertools helpers instead of ad-hoc closures.

The storage layer lives in go/internal/codeintel. It creates the canonical .coding-ethos/code-intel.duckdb DuckDB store, ingests retained lint and hook traces, stores normalized findings/remediations/remediation events, indexes Tree-sitter chunks, records SARIF-to-AST links, and builds DuckDB search-term tables over policy IDs, skill IDs, paths, messages, code chunks, and remediation text. duckdb-vss is active for derived vector rows, but DuckDB facts remain the auditable source of truth.

Graph facts expose provenance classes wherever repo maps, graph reports, and MCP graph surfaces show those facts. EXTRACTED marks parser/static-analysis facts, while GIT_DERIVED, POLICY_DERIVED, TRACE_DERIVED, and DOC_DERIVED mark deterministic derived evidence. INFERRED and AMBIGUOUS are advisory only and must not become enforcement inputs without a deterministic source fact behind them. Existing records without explicit provenance default to EXTRACTED so migrated indexes stay conservative.

Code health scoring is another derived DuckDB surface. code-intel health and MCP code_intel_health compute deterministic refactoring rankings from indexed files/chunks, large files/functions, complex functions, complex conditionals, exact structural clones, git hotspots, co-change coupling, ownership risk, repeated lint/hook failures, and LCOV line coverage. Snapshots, targets, evidence rows, and bounded trend history are persisted so every score is explainable by biomarker and evidence ID. Refreshes persist repo-wide snapshots and apply path filters only when reading the ranked targets, so targeted review queries cannot replace the repository health baseline. Consumer repos can reweight or disable biomarkers by glob under code_intel.health in repo_config.yaml for generated, vendor, legacy, or test paths.

Session snapshots are exposed as code-intel session-snapshot and MCP code_intel_session_snapshot. They derive the stable coding_ethos.session.v1 contract from existing hook traces, proxy sessions/events/transforms, memory trace activity, and code-intel freshness metadata. Provider-specific details stay nested under provider.adapters, so agents can inspect current blockers and linked trace IDs without coupling to provider-specific event files or doing broad source reads.

Automatic output pruning applies the configured code_intel_db row-retention policy after high-volume code-intel writes. The default keeps 90 days of trace and proxy-event rows, leaves current AST/code chunks to the index refresh path, checkpoints the DuckDB store, and leaves DuckDB WAL files to checkpoint-managed cleanup instead of retention deletion. Use explicit output-prune maintenance with --vacuum when database compaction is needed.

Hook traces are also normalized into analytics tables. Each hook event stores provider, tool, status, tracking ID, operation kind, target kind, risk category, command and target-set fingerprints, runtime, rewrite state, and target paths. Each decision stores policy/skill IDs, implementation, severity, principle IDs, diagnostic counts, and message/suggestion variant hashes. This lets later analysis answer which hook checks create the most friction, which advice text reduces repeat violations, which operations are rewritten versus blocked, and which targets are frequently involved without reparsing raw provider payloads. Operator review rows can label a hook event as a correct block, false positive, unclear message, over-broad policy, or missing allow-list case.

Agent Proxy foundation data uses the same store. Proxy events are provider-neutral records for outbound provider calls, tool calls, file reads, file listings, payload injections, payload truncations, cache hits, and edit proposals. The session ledger stores request counts, file-read/listing counts, edit counts, cache hits, injection/truncation/denial counts, payload hashes, cache keys, trace/tracking IDs, direction, payload kind, DLP facts, policy evidence, token usage, payload byte counts, and ordered transform records. This is the storage substrate for future context-economy controls; it is not a separate shadow database. The trust boundary and event contract are documented in AGENT_PROXY.md.

Canonical DuckDB Store

The first implementation should create .coding-ethos/code-intel.duckdb with tables for:

Use the DuckDB term index for text, symbol, file path, policy, skill, command, and advice search. Text search is not a fallback; it is part of hybrid retrieval.

Initial command surface:

bin/coding-ethos-run code-intel ingest-traces
bin/coding-ethos-run code-intel stats
bin/coding-ethos-run code-intel hook-usage --risk-category bypass
bin/coding-ethos-run code-intel record-hook-review --trace-id hook-1 --disposition false_positive
bin/coding-ethos-run code-intel hook-reviews --disposition false_positive
bin/coding-ethos-run code-intel repeated-failures --policy-id python.unused_imports
bin/coding-ethos-run code-intel index-code pkg scripts config.yml
bin/coding-ethos-run code-intel git-signals --path pkg/app.go --paths pkg/app.go,pkg/store.go
bin/coding-ethos-run code-intel anatomy-map --path pkg --format toon
ls pkg | bin/coding-ethos-run code-intel enrich-listing --command 'ls pkg'
bin/coding-ethos-run code-intel code-chunks --path pkg/app.go --symbol-name BuildMessage
bin/coding-ethos-run code-intel repo-map --path pkg/app.go
bin/coding-ethos-run code-intel centrality --path pkg --format toon
bin/coding-ethos-run code-intel surprises --path pkg --format toon
bin/coding-ethos-run code-intel decisions add --title 'Use explicit startup' --rationale 'Startup should be inspectable.' --path pkg/app.go
bin/coding-ethos-run code-intel decisions import docs/decisions
bin/coding-ethos-run code-intel decisions list --path pkg/app.go --query startup
bin/coding-ethos-run code-intel decisions health --path pkg/app.go
bin/coding-ethos-run code-intel compact-context --path pkg/app.go
bin/coding-ethos-run code-intel ingest-sarif --file policy.sarif
bin/coding-ethos-run code-intel sarif-results --policy-id python.unused_imports
bin/coding-ethos-run code-intel proxy-file-read --session-id sess-1 --path pkg/app.go
bin/coding-ethos-run code-intel record-proxy-event --event-id evt-1 --session-id sess-1 --kind file_read --provider codex --target-path pkg/app.go
bin/coding-ethos-run code-intel proxy-sessions --provider codex
bin/coding-ethos-run code-intel proxy-events --session-id sess-1
bin/coding-ethos-run code-intel session-snapshot --session-id sess-1 --format toon
bin/coding-ethos-run code-intel remediation-outcomes --outcome repeated
bin/coding-ethos-run code-intel remediation-effectiveness --policy-id python.unused_imports
bin/coding-ethos-run code-intel embedding-candidates --record-kind remediation_outcome
bin/coding-ethos-run code-intel upsert-vector --id rem-1 --model-id voyage-code-3 --vector '0.1,0.2,0.3'
bin/coding-ethos-run code-intel embedding-records --backend duckdb-vss
bin/coding-ethos-run code-intel hybrid-search --text 'unused import' --model-id voyage-code-3 --vector '0.1,0.2,0.3'
bin/coding-ethos-run code-intel index-status --model-id voyage-code-3
bin/coding-ethos-run code-intel search --text 'unused import'

These commands read retained .coding-ethos traces and write only the repo-local .coding-ethos/code-intel.duckdb store.

Workspace commands are the exception to repo-local state because they model a parent directory or worktree family. They store only registry and derived workspace facts under .coding-ethos-workspace/ at the workspace root. Each registered repository keeps its own .coding-ethos/code-intel.duckdb store; workspace queries federate across those stores and annotate results with repo aliases instead of merging databases. scan, add, remove, list, status, and refresh manage the registry, stale HEAD/store metadata, conservative git-history co-change candidates, and contract links derived from existing AST graph edges.

git-signals records the HEAD commit and indexed timestamp before exposing stored signals. A refresh is skipped when the current HEAD is already indexed; --force rebuilds the bounded default window, and --commits raises or lowers that window for large repositories. Reviewer suggestions are deterministic: direct authorship percentage, co-change ownership, and recency relative to the indexed history are reported as score inputs. Co-change rows also mark hidden coupling when no direct static edge explains the pair.

anatomy-map is inspired by Aider’s repo map, which gives agents a compact symbol-level view before they spend context on full file reads. coding-ethos keeps the same core idea but uses the repo-local Go AST/code-intel store rather than Aider’s prompt-time Python parser/cache and global PageRank ranking. The first implementation is intentionally directory-local so proxy listing enrichment can append concise file anatomy without replacing the original tool output. The directory-anatomy-map proxy transform preserves the raw directory listing and appends a compact TOON block with in-memory transform hashes and token counts returned by the shared agent proxy pipeline. The live Bash PostToolUse path uses the shared shell parser plus the conservative ls/tree classifier, refreshes the listed source files, and emits the enriched listing as context with proxy.directory_anatomy evidence. ls stays direct-child only, while tree includes nested files and honors tree -L N as an anatomy depth cap. enrich-listing accepts raw listing output from stdin or --listing-file and can infer the target directory and recursive depth from a static, single-target ls or tree command for replay and debugging.

repo-map and the startup coding_ethos_repo_map hook context provide the global variant. The hook refreshes the repo-local AST index at SessionStart, ranks indexed files by compact symbol/chunk signals, and emits the most useful file/symbol signatures as TOON. The same renderer backs MCP code_intel_repo_map and the read-only coding-ethos://code-intel/repo-map resource so agents can request the current map explicitly before broad exploration.

MCP workspace support adds code_intel_workspace_status plus an optional repo argument on code_intel_search, code_intel_answer, and code_intel_repo_map. Omitting repo preserves repo-local behavior. repo="<alias>" opens that registered repo’s independent code-intel store and reports provenance in the result metadata. repo="all" is bounded to read-only aggregate queries and returns per-repo provenance rather than a synthetic merged repository.

graph-report composes the repo map, code-intel store counts, and the latest stored health snapshot into a human, JSON, or TOON orientation report. It is read-only with respect to indexing and tells agents when the AST index or health snapshot needs an explicit refresh.

graph-report also includes deterministic topology communities derived from existing file-level code_edges and git_cochanges. The v1 implementation uses weighted connected components with stable ordering rather than Leiden or another external graph dependency. Community IDs are advisory orientation hints exposed in graph-report, repo-map, and context-card output; enforcement decisions must not depend on community detection.

centrality and surprises expose deterministic graph-orientation views over the same DuckDB facts. Central nodes rank files by explainable structural degree, git co-change, health-priority, finding, and remediation-outcome signals. Surprise edges highlight deterministic cross-boundary relationships such as cross-directory, cross-language, policy/config-to-code, documentation-to-code, test-to-production, and hidden git co-change links. These views are included in graph-report and available as standalone code-intel centrality and code-intel surprises commands. They carry provenance and explanation fields and remain advisory: they help agents choose what to inspect before editing, but they do not block or permit policy decisions.

Markdown documentation is also indexed into the same graph when it contains explicit repo path references. Headings and code blocks remain ordinary Markdown chunks; path references such as go/internal/codeintel/store.go create documents edges, and path#symbol references create mentions edges. Repo-root paths stay repo-relative; Markdown ./ and ../ references resolve relative to the referencing Markdown file’s directory before being stored. Rationale-like headings such as “Rationale”, “Decision”, “Reasoning”, or “Why” classify explicit references as advisory rationale_for links. Documentation links use DOC_DERIVED provenance for extracted references and INFERRED provenance for rationale classification, and graph-report exposes them in a document_links section. This is deterministic Markdown-only graph context: it does not parse PDFs or images, does not call an LLM, and does not turn examples or prose into policy decisions.

Decision intelligence stores explicit architectural rationale in the same DuckDB index. The decisions CLI group records manual decisions, links them to paths or symbols, indexes inline WHY:, DECISION:, and TRADEOFF: markers found during code indexing, and reports stale, conflicting, overlapping, or ungoverned decision areas through decisions health.

decisions import reads explicit Markdown decision records from a supplied file or directory. Full-repo code indexing also imports the default decision locations: adr/, docs/adr/, docs/decisions/, and .coding-ethos/decisions/. Imported Markdown must opt in with YAML front matter, either coding_ethos_decision: true or a decision / architecture-decision tag. Generic headings such as “Decision”, “Rationale”, or README examples are not imported. Front matter may include title, status, rationale, alternatives, author, recorded_at_utc, updated_at_utc, affected_paths, affected_files, affected_modules, and affected_symbols entries with path plus symbol_path.

proxy-file-read is the current bridge for read deduplication. It reads a repo-relative file, computes the current content hash, records the first read as a file_read proxy event, and records later unchanged reads in the same session as cache_hit events with a file-read-cache transform. Changed content is a cache miss and returns the full file body again. The command gives transparent proxy work a tested cache primitive without requiring provider/API interception to exist first.

Implemented Storage Foundation

The local store is the durable evidence ledger for CEL, SARIF, AST facts, hook analytics, remediation advice, vector metadata, and remediation outcomes. The target is a complete storage foundation, not a minimal placeholder.

DuckDB remains canonical and should gain:

The first query/CLI surface should answer:

Vector work uses always-built duckdb-vss tables. Metadata stays in normal DuckDB tables for auditability and filtering, while vectors live in dimension-specific vec0 virtual tables. This preserves the ETHOS one-path build contract and keeps vector indexes rebuildable from DuckDB facts.

Vector Backends

Define a narrow vector backend interface before binding to any implementation:

The initial backend priority:

  1. duckdb-vss search for AST-aware code and remediation vectors.
  2. Future service backends only if team-scale deployment needs them and they can be kept outside local enforcement.

Backend records must include enough metadata to filter before or during vector search:

AST Chunking

Use Tree-sitter to produce semantic chunks instead of line windows:

Every chunk should have:

Incremental indexing is required. A changed file should only re-embed changed chunks and dependent summary chunks when their content or structural hash changes.

The active implementation indexes Go, Python, JavaScript/TypeScript, shell, and YAML through Tree-sitter. Each indexed file is written to the canonical DuckDB store with a content hash, line count, parser metadata, and index timestamp. Each extracted symbol or configuration entry is written as a stable code_chunk, mirrored into the term index, and exposed as an embedding candidate with record_kind=code_chunk. Parent chunk IDs, containment edges, import edges, and same-file reference edges are stored in DuckDB. SARIF/CEL results that carry AST identity are linked back to matching chunks through ast_finding_links. Markdown remains planned until the selected parser exposes a maintained Go binding or the project adds a first-class adapter for its split parser layout.

Code indexing always ignores repository metadata, runtime cache directories, and nested coding-ethos/ tool checkouts. Consumer-specific generated output belongs in repo-local configuration rather than global code. For example, a static-site repo can exclude generated output in repo_config.yaml:

code_intel:
  exclude_paths:
    - "**/dist/**"

Embedding Strategy

Embedding providers must be pluggable and recorded per vector row.

Recommended initial model classes:

Queries should use the matching query/document input mode when the provider supports it. The system must refuse to compare vectors across incompatible model IDs or dimensions.

Hybrid Retrieval

Search should combine exact filters, the term index, vectors, and reranking:

  1. Apply hard filters: repo, path prefix, language, policy, skill, symbol kind, provider, or time range.
  2. Retrieve candidates from the DuckDB term index.
  3. Retrieve candidates from the vector backend.
  4. Fuse ranks with reciprocal rank fusion.
  5. Boost exact symbol/path/policy matches.
  6. Return traceable results with file spans and policy/remediation context.

Vector-only retrieval should be available for diagnostics, but not the default agent path.

MCP Tool Surface

Add tools only after the store has a stable schema:

All tools are advisory. They must not bypass hooks or edit files.

Phases

Phase 1 - Schema and Trace Ingestion

Acceptance criteria:

Phase 2 - AST Indexing

Acceptance criteria:

Phase 3 - duckdb-vss Vector Backend

Acceptance criteria:

Phase 4 - Hybrid Retrieval Hardening

Acceptance criteria:

Phase 5 - MCP Search Tools

Acceptance criteria:

Phase 6 - Outcome Measurement

Acceptance criteria:

Code Similarity Detection

The code-intelligence store supports real-time structural clone detection through MinHash LSH (Locality-Sensitive Hashing). This enables CEL policies to warn agents about duplicate or near-duplicate code at write time.

Storage Schema

Three fields in code_chunks support similarity:

A separate lsh_bands table stores band hashes for sub-linear candidate retrieval:

Column Purpose
chunk_id FK to code_chunks.chunk_id
band_index Band number (0–15)
band_hash Hash of the 8-row band slice
path File path (for filtering self-matches)
symbol_name Symbol name (for diagnostics)

Detection Pipeline

  1. AST indexing normalizes each chunk’s tokens and computes normalized_hash and minhash_sig.
  2. LSH bands are computed (16 bands × 8 rows per band) and stored in lsh_bands.
  3. At policy evaluation time, similarity_facts queries:
    • Exact matches via normalized_hash equality (100% similarity).
    • LSH candidates via band hash collisions, refined with full Jaccard estimation. The default config uses ≥0.7 for candidates and ≥0.8 for policy activation.
  4. Results are exposed to CEL as similarity_facts and reported through SARIF relatedLocations.
  5. Agents can call MCP code_similarity_check with proposed code to inspect matches before writing a duplicate implementation.

Reconciliation

LSH bands are transactionally reconciled with code chunks. When a file is reindexed, bands for the old path are deleted before new chunks and bands are inserted within the same transaction. This prevents stale band entries from producing false-positive candidates.

Configuration

The default similarity config uses 128 hash functions, 16 bands, and 8 rows per band. This produces a candidate threshold of approximately 0.54 Jaccard similarity (the S-curve inflection point), which means most pairs above ~54% similarity will share at least one band hash. The runtime candidate threshold defaults to 0.7, and the structural threshold defaults to 0.8.

similarity:
  enabled: true
  minhash_size: 128
  shingle_size: 5
  lsh_bands: 16
  lsh_rows: 8
  min_symbol_lines: 5
  exact_normalized: true
  candidate_threshold: 0.7
  structural_threshold: 0.8
  max_matches: 10

Risks and Mitigations

Open Questions