Skip to content

Core Concepts

A deeper look at what Skippr does under the hood and why it works the way it does.

Bronze / Silver / Gold

Skippr organises data into three tiers inside your warehouse, following the medallion architecture pattern used by most modern data teams:

TierWhat lives hereWho creates it
BronzeRaw extracted data, exactly as it appeared in the sourceskippr extract-and-load
SilverCleaned, typed, and renamed staging modelsdbt (AI-assisted)
GoldBusiness-ready marts and aggregationsdbt (AI-assisted, then yours to extend)

Each tier gets its own schema (e.g. RAW, project_silver, project_gold), keeping raw ingestion cleanly separated from transformed layers. You can query any tier directly.

Schema Discovery and Evolution

During the Discover phase, skippr reads source metadata to learn table names, column names, and data types. No manual DDL, schema registry, or YAML mapping is required for the first run.

How discovery works per source type:

  • Databases -- read catalog metadata such as table and column definitions.
  • Object stores -- sample structured files and infer column names and types.
  • Streams and messaging -- infer structure from incoming records.
  • HTTP and network inputs -- infer structure from payload shape.

Schema handling is deterministic. When source data changes shape, Skippr prefers additive evolution over silent destructive rewrites:

  • compatible new fields are added
  • nested structures are preserved as structured fields where possible
  • incompatible type shifts are surfaced as explicit schema evolution instead of pretending the old column still means the same thing

Deterministic vs AI-Assisted Work

The important trust boundary is that ingestion correctness is deterministic and reviewable.

Deterministic responsibilities

  • schema discovery and destination mapping
  • type reconciliation and evolution handling
  • incremental checkpoints and replay behavior
  • CDC reconciliation logic such as business keys, order tokens, and tombstones

AI-assisted responsibilities

  • dbt model and test scaffolding
  • naming and staging-model structure
  • descriptive metadata and model documentation

By default, only schema metadata is sent to the model. Data samples are optional and off by default.

Incremental Sync and CDC Correctness

Skippr has two related but different correctness stories:

  • Incremental sync -- the extract-and-load engine tracks source progress so reruns only process new or changed data.
  • CDC final-state reconciliation -- supported CDC source and destination pairs converge on the correct final table state using business keys, order tokens, and tombstones.

For batch and incremental sync, progress is only advanced when the corresponding load has been durably committed. For CDC, the committed change batch is the authority for resume and replay. See CDC Guarantees for the exact contract.

Data Privacy and Trust Boundary

Row-level data only ever exists in two places: the machine running skippr, and your destination.

  • Source data is read locally and written directly to the destination.
  • AI modeling uses metadata by default. Data samples are optional and off by default.
  • Skippr's cloud path handles authentication and control-plane services, not row-level source or warehouse data.
  • Credentials live in environment variables, never in config files.

dbt Integration

Skippr generates a standard, fully functional dbt project:

  • dbt_project.yml and profiles.yml -- auto-configured for your warehouse
  • models/schema.yml -- source definitions pointing at bronze tables
  • models/staging/stg_*.sql -- silver models with type casting and renaming
  • packages.yml -- any required dbt packages

After the pipeline runs, the project is yours. Add tests, snapshots, custom gold models, or plug it into your existing dbt CI/CD -- it's standard dbt, nothing proprietary.

Supported connectors

Sources

CategorySourceKind identifier
DatabasesMicrosoft SQL Servermssql
MySQLmysql
PostgreSQLpostgres
Amazon Redshiftredshift
MongoDBmongodb
Amazon DynamoDBdynamodb
ClickHouseclickhouse_source
MotherDuckmotherduck_source
Object StoresAmazon S3s3
SFTPsftp
Delta Lakedelta_lake
StreamingKafkakafka
Amazon SQSsqs
Amazon Kinesiskinesis
AMQP (RabbitMQ)amqp
Amazon SNSsns
Amazon EventBridgeeventbridge
MQTTmqtt
WebSocketwebsocket
HTTPHTTP Clienthttp_client
HTTP Serverhttp_server
OtherSocket (TCP/UDP/Unix)socket
StatsDstatsd
Local Filefile
Stdinstdin

Destinations

CategoryDestinationKind identifier
WarehousesSnowflakesnowflake
Google BigQuerybigquery
PostgreSQLpostgres
AWS Athena (S3 + Glue)athena
Databricksdatabricks
Azure Synapsesynapse
Amazon Redshiftredshift
ClickHouseclickhouse
MotherDuckmotherduck
Cloud StorageGoogle Cloud Storagegcs
Azure Blob Storageazure_blob
SFTPsftp
MessagingAMQP (RabbitMQ)amqp
OtherLocal Filefile
Stdoutstdout