Source landing semantics
Some sources—especially API and SaaS connectors—declare how each namespace’s data must land in the warehouse. Skippr calls this a namespace contract. Contracts are separate from column types discovered by skippr discover.
Namespace contracts
For every table (namespace) a source can emit, the source plugin publishes a contract that includes:
| Field | Meaning |
|---|---|
| Primary key | Logical row identity (business dimensions plus identifiers such as property_id and date) |
| Partition key | Columns that define a physical slice for partition-scoped writes (often date for daily reports) |
| Write policy | How the configured data sink must apply each batch |
| Refresh window | Optional number of days to re-fetch before the checkpoint (for APIs that revise past days) |
| Semantics | Descriptor such as mutable_report (informational; does not change engine behavior by itself) |
The host validates contracts when a source starts and checks that your pipeline’s data sink supports every declared write policy.
Write policies
| Policy | Use when | Typical sources |
|---|---|---|
append | Rows are immutable or append-only | Event streams, webhooks |
merge_by_key | You need current state by business key | Entity snapshots (sink must support merge) |
replace_partition | A partition can be fully rewritten when numbers change | GA4 daily reports, other mutable report APIs |
replace_table | Small, bounded full snapshots | Reference dumps |
Mutable reports: APIs like GA4 can change metrics for dates you already synced. Appending new rows duplicates or conflicts with old values. replace_partition tells the sink to drop and rewrite the partition for the batch’s partition key values (for example date=2024-01-15) before writing new Parquet.
Lookback / refresh window: The source re-pulls the last N days on each run so corrected API values overwrite the same partitions. Checkpoints record progress (for example last completed date); they do not by themselves prove the warehouse is correct without the matching write policy and lookback.
Destination pairing
| Data sink | replace_partition | Notes |
|---|---|---|
| Athena (S3 + Glue) | Supported | Deletes the contract partition prefix in S3, then writes new files; Glue partition keys follow the contract |
| Iceberg | Supported | Native partition replace |
| Append-only sinks | Not suitable for mutable report sources | Pipeline validation fails if the sink cannot honor the contract |
Pair mutable report sources with Athena or Iceberg and run skippr discover after the first sync so Glue/Iceberg schemas include contract partition columns.
Discover vs contracts
| Layer | Controls |
|---|---|
| Namespace contract | How batches land (append vs replace partition, partition columns) |
| Arrow / discovered schema | Column names and types in bronze |
Contract partition keys do not add columns by themselves; dimensions such as date must appear in the ingested records. Discovery and schema sinks align catalog DDL with both the contract and observed data.
Example: Google Analytics 4
GA4 curated daily namespaces use replace_partition on date, with lookback_days driving the refresh window. See Google Analytics (GA4) and Athena.
For plugin authors, see the maintainer guide API / SaaS source plugins in the skipprd repository.
