ClickHouse for Advertising Analytics: Lessons from High-Cardinality Data
Advertising analytics is a perfect stress test for databases. The data looks simple at first: date, campaign, ad group, keyword, spend, clicks, sales. Then you add thousands of advertisers, multiple marketplaces, placement breakdowns, search terms, hourly metrics, attribution windows, and suddenly every dashboard query has high-cardinality dimensions.
ClickHouse is very good at this workload, but only when the table design respects how ClickHouse reads data. The first version of a schema can feel fast in development and become painful once the real cardinality arrives.
Model Around Query Patterns
For analytics tables, I start from the dashboards and API endpoints. Which dimensions are always filtered? Which dimensions are grouped? Which time ranges are common? The answers drive partitioning, ordering, and materialized views.
CREATE TABLE campaign_daily_metrics
(
event_date Date,
company_id UInt64,
profile_id UInt64,
campaign_id UInt64,
marketplace LowCardinality(String),
campaign_type LowCardinality(String),
impressions UInt64,
clicks UInt64,
cost Decimal(18, 6),
sales Decimal(18, 6)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (company_id, profile_id, event_date, campaign_id);
The order key matters. If most queries filter by company and profile before date, putting those columns first lets ClickHouse skip large parts of the dataset.
High Cardinality Is Not Always Bad
A common misconception is that high-cardinality columns are automatically a problem. They are a problem when they appear in the wrong part of the query path. A `search_term` column with millions of values is fine if it lives in a table designed for search-term reports. It is not fine if every dashboard query scans it unnecessarily.
The pattern that worked best for us is separating broad dashboard tables from deep analysis tables. Dashboards use compact, pre-aggregated daily tables. Drill-down pages use more detailed fact tables.
Pre-Aggregation Is A Product Feature
Pre-aggregation is not only a database optimization. It changes product latency. A dashboard that loads in 400ms feels like a different product from one that takes six seconds.
CREATE MATERIALIZED VIEW campaign_monthly_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(month)
ORDER BY (company_id, profile_id, month, campaign_id)
AS
SELECT
toStartOfMonth(event_date) AS month,
company_id,
profile_id,
campaign_id,
sum(impressions) AS impressions,
sum(clicks) AS clicks,
sum(cost) AS cost,
sum(sales) AS sales
FROM campaign_daily_metrics
GROUP BY month, company_id, profile_id, campaign_id;
Operational Lessons
- Keep raw ingestion tables separate from serving tables.
- Use `LowCardinality(String)` for repeated dimensions like marketplace, campaign type, and status.
- Design materialized views for the product screens people actually open every day.
- Measure query plans after realistic data volume, not after a local seed file.
- Make backfills idempotent because advertising APIs will force you to rerun imports.
ClickHouse rewards clarity. If the schema reflects access patterns, it feels almost unfairly fast. If the schema tries to be a generic warehouse for every possible question, the product pays for that ambiguity on every request.
Comments
Post a Comment