Skip to content

Types

Data structures shared across the codebase. Parsers produce these, the indexer consumes them.

Shared data types for the SQL indexer.

These dataclasses define the contract between parsers and the indexer orchestrator. Every language parser returns a ParseResult. The orchestrator consumes ParseResults and writes to DuckDB. Parsers never touch the database. The orchestrator never does language-specific parsing.

NodeResult dataclass

NodeResult(
    kind,
    name,
    line_start=None,
    line_end=None,
    metadata=None,
)

A nameable entity found in a file.

Nodes are the universal unit of the knowledge graph. A node is anything a parser identifies as structurally meaningful: a table, view, CTE, function, class, module, API endpoint, Terraform resource, etc.

The kind field is parser-defined and unconstrained -- each language emits whatever kinds are meaningful for it.

Attributes:

Name Type Description
kind str

Entity type (e.g. "table", "view", "cte").

name str

Unqualified entity name (e.g. "orders").

line_start int | None

First line in the source file, or None if unknown.

line_end int | None

Last line in the source file, or None if unknown.

metadata dict | None

Arbitrary parser-supplied metadata (schema, dialect, filters, etc.).

EdgeResult dataclass

EdgeResult(
    source_name,
    source_kind,
    target_name,
    target_kind,
    relationship,
    context=None,
    metadata=None,
)

A relationship between two entities.

Edges reference nodes by (name, kind) pairs, not database IDs. The indexer orchestrator resolves these to node IDs during insertion. This means parsers don't need to know about the database and parse order doesn't matter.

The target may be in another file or even another repo. If unresolved at insert time, the orchestrator creates a phantom node.

Attributes:

Name Type Description
source_name str

Name of the source node.

source_kind str

Kind of the source node (e.g. "query").

target_name str

Name of the target node.

target_kind str

Kind of the target node (e.g. "table").

relationship str

Edge label (e.g. "references", "defines", "inserts_into", "cte_references").

context str | None

Human-readable context (e.g. "FROM clause", "JOIN clause").

metadata dict | None

Arbitrary edge metadata (source_schema, target_schema, etc.).

ColumnUsageResult dataclass

ColumnUsageResult(
    node_name,
    node_kind,
    table_name,
    column_name,
    usage_type,
    alias=None,
    transform=None,
)

SQL-specific: column-level lineage from sqlglot.

Records which columns are used where and how. Only the SQL parser populates these -- all other parsers return an empty list.

This data is stored in a separate table from edges because column usage is high-volume with its own query patterns (flat scans, not graph traversals).

Attributes:

Name Type Description
node_name str

Name of the query/CTE/view that uses this column.

node_kind str

Kind of the owning node (e.g. "query", "cte").

table_name str

Source table the column belongs to.

column_name str

Column name ("*" for SELECT *).

usage_type str

How the column is used. One of "select", "where", "join_on", "group_by", "order_by", "having", "insert", "update", "partition_by", "window_order", "qualify".

alias str | None

Output alias if the column is aliased (AS name).

transform str | None

Wrapping expression, e.g. "CAST(a.updated AS DATETIME)".

LineageHop dataclass

LineageHop(column, table, expression=None)

One hop in a column lineage chain.

Attributes:

Name Type Description
column str

Column name at this hop.

table str

Table, CTE, or subquery name at this hop.

expression str | None

Transform applied at this hop (e.g. "CAST(amount AS DECIMAL)"), or None if the column passes through unchanged.

ColumnLineageResult dataclass

ColumnLineageResult(
    output_column, output_node, chain=list()
)

End-to-end column lineage through CTEs and subqueries.

Traces an output column back to its source table column(s), recording each intermediate hop (CTE, subquery, transform).

Attributes:

Name Type Description
output_column str

Column name in the final output.

output_node str

The query, table, or view that produces this column.

chain list[LineageHop]

Ordered hops from output back to source.

ColumnDefResult dataclass

ColumnDefResult(
    node_name,
    column_name,
    data_type=None,
    position=None,
    source="definition",
    description=None,
)

Column definition metadata extracted from SQL or schema files.

Records column-level metadata for tables and views, including the column's data type, ordinal position, provenance, and optional description. Parsers emit these alongside nodes and edges so the indexer can build a column-level catalogue.

Attributes:

Name Type Description
node_name str

The table or view this column belongs to.

column_name str

Column name as declared.

data_type str | None

SQL data type (e.g. "VARCHAR", "INT"), or None if unknown.

position int | None

Ordinal position in the column list (0-based), or None if unavailable.

source Literal['definition', 'inferred', 'schema_yml', 'sqlmesh_schema']

How this column was discovered. One of "definition" (from CREATE/ALTER DDL), "inferred" (from SELECT output), "schema_yml" (from dbt schema.yml), "sqlmesh_schema" (from sqlmesh model schema).

description str | None

Human-readable column description, or None.

ParseResult dataclass

ParseResult(
    language,
    nodes=list(),
    edges=list(),
    column_usage=list(),
    column_lineage=list(),
    columns=list(),
    errors=list(),
)

Everything a parser returns for one file.

This is the complete interface contract. A parser receives a file path and its content, and returns one of these. The orchestrator handles everything from here -- ID assignment, edge resolution, database writes.

Mutation contract

ParseResult is intentionally mutable (not frozen=True). Renderers and post-processing steps mutate nodes, edges, and other lists in-place -- e.g. appending synthetic nodes, deduplicating edges, or rewriting names during normalisation. This is by design: allocating a new ParseResult for every transform would add complexity with no practical benefit, since a ParseResult is owned by a single file-processing pipeline and is never shared across threads.

Attributes:

Name Type Description
language str

Parser language identifier (e.g. "sql").

nodes list[NodeResult]

Entities discovered in the file.

edges list[EdgeResult]

Relationships between entities.

column_usage list[ColumnUsageResult]

Column-level usage records (SQL only).

column_lineage list[ColumnLineageResult]

End-to-end column lineage chains (SQL only).

columns list[ColumnDefResult]

Column definitions extracted from DDL or schema files.

errors list[str]

Non-fatal parse errors encountered during processing.

parse_repo_config

parse_repo_config(cfg, global_dialect=None)

Parse a repo config value into (path, dialect, dialect_overrides).

Supports both simple string paths and full config dicts::

"my-repo": "/path/to/repo"
"my-repo": {"path": "/path", "dialect": "starrocks",
            "dialect_overrides": {"athena/": "athena"}}
Source code in src/sqlprism/types.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
def parse_repo_config(
    cfg: str | dict,
    global_dialect: str | None = None,
) -> tuple[str, str | None, dict[str, str] | None]:
    """Parse a repo config value into (path, dialect, dialect_overrides).

    Supports both simple string paths and full config dicts::

        "my-repo": "/path/to/repo"
        "my-repo": {"path": "/path", "dialect": "starrocks",
                    "dialect_overrides": {"athena/": "athena"}}
    """
    if isinstance(cfg, str):
        return cfg, global_dialect, None
    return (
        cfg["path"],
        cfg.get("dialect", global_dialect),
        cfg.get("dialect_overrides"),
    )