Skip to content

Config File

Skippr uses one project config file: skippr.yml. The same full engine schema is used by skippr and skipprd.

Example

yaml
skippr:
  workspace: mssql_migration

pipelines:
  mssql_to_snowflake:
    data_source: data_sources.mssql
    data_sink: data_sinks.snowflake
    cdc:
      business_key_columns: [id]

data_sources:
  mssql:
    Mssql:
      connection_string: ${MSSQL_CONNECTION_STRING}
      tables: ["dbo.customers", "dbo.orders"]
  postgres_cdc:
    Postgres:
      connection_string: ${POSTGRES_CONNECTION_STRING}
      tables: ["public.orders"]
      cdc_mode: snapshot_then_cdc

data_sinks:
  snowflake:
    Snowflake:
      account: ${SNOWFLAKE_ACCOUNT}
      user: ${SNOWFLAKE_USER}
      database: ANALYTICS
      schema: RAW
      warehouse: COMPUTE_WH
      role: ACCOUNTADMIN
      private_key_path: ${SNOWFLAKE_PRIVATE_KEY_PATH}

schema_sinks: {}
runtime_plugins: {}

Top-Level Sections

SectionPurpose
projectTop-level project id used for hosted React scope (project_id) and related paths (including Lance for doc vectors).
skipprWorkspace and internal skipprd extract/load defaults
pipelinesNamed pipelines: ELT (data_source / data_sink, …) and doc-vector ingest (vector_source pointing at vector_sources, optional chunk fields)
data_sourcesSource plugin configuration keyed by name
data_sinksDestination plugin configuration keyed by name
schema_sinksCatalog/schema plugin configuration keyed by name
runtime_pluginsOptional explicit runtime plugin manifest paths
vector_sourcesNamed file trees for skippr vector ingest-docs (declarative root, include, optional exclude / extensions / chunk overrides). No warehouse required.

Storage Settings

skippr.skipprd_el_storage_mode is an internal development/testing setting that controls where skipprd extract/load state is stored (local or s3). It does not control dbt project storage, React thread logs, or vector storage for skippr model; authenticated modeling runs use the storage credentials returned by the Skippr API.

The equivalent environment variable for direct skipprd runs is SKIPPRD_EL_STORAGE_MODE.

Pipelines

Each pipeline references registry entries by section-qualified name:

yaml
pipelines:
  ingest_orders:
    data_source: data_sources.postgres
    data_sink: data_sinks.iceberg

Use skippr discover --pipeline ingest_orders to persist metadata, then skippr sync --pipeline ingest_orders --once to load data. If metadata is missing, sync runs discovery automatically before loading. Run skippr model --pipeline ingest_orders after sync when you are ready to generate and validate dbt assets.

Documentation vectors (vector_sources)

For marketing docs, internal knowledge bases, or other text-first corpora, define vector_sources plus a pipelines entry (conventionally named vector_ingest) with vector_source set to the key you want from vector_sources. Then run skippr vector ingest-docs (optional --pipeline if you use a different name). That path uses the same Skippr authentication and tenant S3 layout as modeling, but does not require data_sinks or a warehouse. See the vector CLI reference for flags and examples.

yaml
pipelines:
  vector_ingest:
    vector_source: web_docs
    chunk_chars: 1200
    chunk_overlap: 120

vector_sources:
  web_docs:
    root: docs
    include: ["**/*.md"]

Plugin Entries

Plugin sections use the plugin name as the single key under each named entry:

yaml
data_sources:
  postgres:
    Postgres:
      connection_string: ${POSTGRES_CONNECTION_STRING}
      tables: ["public.orders"]

Destination entries follow the same shape:

yaml
data_sinks:
  iceberg:
    Iceberg:
      table_namespace: analytics
      table_location_prefix: s3://my-bucket/warehouse
      catalog:
        type: glue
        warehouse: s3://my-bucket/warehouse
        database: analytics
        region: us-east-1

CDC-capable source plugins use cdc_mode to choose how source reads begin:

ValueBehavior
snapshotBounded snapshot only.
snapshot_then_cdcFull initial snapshot, then native CDC stream.
cdc_onlyNative CDC stream only, with no initial snapshot.

Modeling Settings

skippr model --pipeline <name> derives modeling provider settings from the selected pipeline's data_sink entry and built-in defaults for catalog, dbt, vector storage, and LLM settings. Do not add react:, top-level providers:, top-level dbt:, or skippr.tenant to CLI-owned skippr.yml files. Tenant identity comes from authenticated Skippr credentials. By default, model resumes the latest modeling thread for the pipeline; use skippr model --pipeline <name> --no-resume to start a fresh thread.

Environment Variables

Use ${ENV_VAR} syntax for secrets and deployment-specific values:

yaml
data_sources:
  mssql:
    Mssql:
      connection_string: ${MSSQL_CONNECTION_STRING}

Keep secure values in the environment or your secret manager, not in skippr.yml.