AST, CEL, and SARIF Architecture
coding-ethos policy work follows one architecture:
- Go collects facts.
- CEL decides policy when the rule can be expressed over those facts.
- SARIF reports exact, stable, remediation-ready findings.
- The code-intelligence store retains the evidence for later agent search.
This is the first path to use for new source-aware enforcement. Do not add a new ad hoc text scanner, one-off AST traversal, or policy-specific parser unless the shared fact path cannot represent the required input yet. In that case, extend the fact collector first, then expose the new fact through CEL, SARIF, MCP, and code-intelligence storage from the same normalized record.
The goal is not “AST everywhere” for its own sake. The goal is one inspectable pipeline for source-aware policy:
file/edit/diff/command
|
v
Go parser and context collectors
|
+--> normalized facts
| python_ast
| shell_commands
| proposed_symbol_changes
| changed_symbols
| similarity_facts
| tool_capabilities
|
+--> CEL policy decisions
|
+--> diagnostics and agent_remediation
|
+--> SARIF with AST identity and fingerprints
|
+--> DuckDB code-intel store, FTS5, and duckdb-vss metadata
|
+--> MCP search, policy explanation, and remediation tools
Responsibilities
Go Fact Collection
Go owns parsing and host inspection. It is responsible for:
- Tree-sitter parsing and parser lifecycle.
- File, command, shell, Git, diff, and hook context collection.
- Normalized fact records with stable field names.
- Syntax recovery and fail-fast errors where policy cannot run safely.
- Diagnostic location metadata: file, line, column, end line, node kind, symbol kind, symbol path, and parent symbol path.
For Python, the reusable fact surface is python_ast. It is populated from the
same Tree-sitter fact collector used by the compiled Python evaluators. It
includes imports, calls, functions, classes, assignments, lambdas, exception
handlers, symbol context, ancestry flags, and initial function signature facts.
For shell, the reusable fact surface is shell_commands. It is populated from
the full shell parser, not substring matching. It exposes command names, argv,
assignments, redirects, write targets, heredocs, command substitutions, process
substitutions, subshells, dynamic expansion, background execution, and
line/column metadata. Git routing, lint capture, malformed shell checks, and
command-safety CEL rules should consume these parsed facts instead of scanning
raw command text.
For changed source, proposed_symbol_changes and changed_symbols map
Edit/Write/MultiEdit payloads and staged diff hunks to Tree-sitter symbols.
They include the file, language, node kind, symbol kind/name/path, line spans,
content hashes, action, and line-count delta. Large-file and large-symbol
policies use these symbol deltas so shrinking refactors remain allowed while
growth is blocked before commit.
For hook command implementations, hook_commands maps Go command functions to
ordered call facts. Hook-stage commands that run path-sensitive checks in
pre-commit or pre-push must apply the hook-provided changed-file list before
invoking whole-surface or configured-path runners. CEL owns the policy decision;
Go owns the ordered call facts and diagnostic location for the offending command
function.
For persisted context, code-intelligence indexing uses the same parser
foundation to store code chunks, config entries, parent/child relationships,
graph edges, parser metadata, and AST-to-finding links in
.coding-ethos/code-intel.duckdb. This keeps MCP retrieval and future embedding
search on the same source identity used by CEL and SARIF.
CEL Policy Decisions
CEL owns configurable policy predicates. If a decision can be expressed as a
boolean over normalized facts, put that decision beside its ETHOS principle in
coding_ethos.yml or the compiled policy configuration.
Examples:
python_ast.exists(fact, fact.is_dynamic_import)
python_ast.exists(
fact,
fact.node_kind == "function_definition" &&
fact.parameter_count > 5
)
Do not put parsing logic, path probing, Git execution, or semantic extraction in CEL. Add or extend Go facts instead, then keep the policy expression small and auditable.
SARIF Reporting
SARIF owns durable machine-readable output and is the superset of CEL evidence. Everything coding-ethos observes and can safely retain belongs in SARIF: normalized diagnostics, pathless tool-level failures, parser state, raw stdout/stderr payloads, exit status, sandbox evidence, remediation metadata, and source identity. CEL receives only the understood subset: stable facts and diagnostics that can support deterministic policy decisions. Do not discard observed evidence merely because it is not yet expressible in CEL.
AST-backed findings must preserve:
ruleIdfrom the policy ID.- Exact artifact location and region.
- AST properties such as
ast_node_kind,ast_symbol_kind,ast_symbol_path, andast_parent_symbol_path. - Partial fingerprints that remain stable across unrelated line movement.
- Principle and skill metadata so GitHub code scanning and MCP remediation advice point to the same explanation.
The same normalized finding should also be ingestible by the code-intelligence store. SARIF should carry enough stable metadata for later joins to code chunks, remediation outcomes, hook usage analytics, and vector metadata without reinterpreting the original policy.
Extension Workflow
When porting guidance from pyqa_lint or adding another source-aware policy:
- Identify the fact shape needed by the rule.
- Extend the shared Go fact collector if the fact does not exist.
- Add CEL schema coverage for the fact when the decision should be configurable.
- Express the policy in CEL when possible.
- Emit diagnostics through the existing policy path so SARIF receives the AST metadata automatically.
- Persist or link the evidence in code-intelligence storage when it should be searchable later.
- Add tests at the fact, CEL, evaluator, SARIF, and storage layers proportional to the new behavior.
Anti-Patterns
- Do not use substring matching for shell or source policy when a parsed fact surface exists.
- Do not add a policy-specific Tree-sitter traversal when the shared chunk/fact collector can be extended.
- Do not put parsing, path probing, or Git execution inside CEL.
- Do not emit SARIF locations that are not grounded in real files and spans.
- Do not create a second MCP-only interpretation of a policy. MCP should query the compiled bundle, retained traces, SARIF, and code-intel records.
- Do not hide current gaps behind optional build tags or degraded paths. If a fact is required for enforcement, it must be in the normal build and test path.
Current Python Fact Uses
The current Python AST policies already use this path for:
- Conditional import enforcement:
- nested imports
TYPE_CHECKINGimportsexcept ImportError/ModuleNotFoundError- module
__getattr__ __import__importlib.import_module
- Functional idiom guidance:
- assigned lambdas
- returned or assigned nested closure factories
Code Similarity Facts
The similarity_facts CEL variable exposes MinHash LSH-based code clone
detection results. The pipeline:
- Token normalization: Tree-sitter chunks are normalized (identifiers →
$ID, literals →$STR/$NUM) and hashed for Type-2 clone detection. - MinHash signatures: each chunk gets a 128-value MinHash signature stored
in
code_chunks.minhash_sig. - LSH band storage: signatures are split into 16 bands × 8 rows and stored
in the
lsh_bandstable for sub-linear candidate retrieval. - Exact match:
normalized_hashequality finds structurally identical chunks across files (Type-2 clones at 100% similarity). - LSH candidate retrieval + Jaccard refinement: band hash collisions produce candidates; full Jaccard estimation filters below the configured candidate threshold.
CEL receives similarity_facts as a list of match records, each carrying:
file,symbol_name,symbol_kind,symbol_path,language(source chunk)match_path,match_symbol_name,match_symbol_kind,match_start_line(target)similarity(float64, 0.0–1.0)exact_normalized(bool, true when normalized_hash matches exactly)
The default policy expression:
similarity_facts.exists(fact, fact.similarity >= 0.8)
When the policy fires, applySimilarityDiagnostic enriches the diagnostic with:
- A human-readable message listing each match with path, line, and similarity percentage.
- SARIF
relatedLocationsentries pointing at each matching symbol so IDE integrations can navigate directly to the existing code.
The default config.yaml settings use a 0.7 LSH candidate threshold and 0.8
structural threshold. Repos can tune similarity.candidate_threshold,
similarity.structural_threshold, similarity.min_symbol_lines, and MinHash
shape parameters without changing the compiled code.
Current Python Fact Uses
Future ports from pyqa_lint should add facts or CEL predicates for strict
typing, signature width, docstring sections, value-type dunder inference,
interface boundaries, DI composition roots, cache wrappers, Python hygiene, and
package documentation conventions.