Methodology: from raw orders to lift

This page explains how Dema computes an incrementality result, end to end. It answers three questions customers ask most often: at what level is inference applied, how is regional data aggregated, and why can numbers differ depending on the metric or the chart you’re looking at.

From raw orders to a result

Every incrementality result is produced by the same pipeline. Understanding the steps makes the outputs — and any apparent discrepancies — much easier to reason about.

Orders are located

Each order is mapped to a geographic location from its delivery or billing postal code. Postal codes are normalized first (for example, UK postcodes are reduced to the outward code) so that every order lands in a consistent geographic unit. There is also an option to map the orders by the location of their placement, which, in some cases, may improve the model quality.

Locations are grouped into zones

Postal codes are clustered into zones — commute zones, DMAs, or the regional units appropriate for your market. Zones reflect real-world shopping behavior (people shop across city boundaries), so they make better treatment and control units than rigid administrative lines.

Data is aggregated by zone and period

Within each zone, orders are summed into periods for the metric being measured — daily by default, with a weekly option. The result is a balanced panel: one value per zone per period. Periods with no orders in a zone are filled with zero so the time series is continuous.

A synthetic control is built

The treatment zones are combined into a single treated series. Dema then builds a synthetic control: a weighted blend of the remaining (control) zones chosen to track the treated series as closely as possible during the pre-test period. See How incrementality testing works for the intuition behind synthetic control.

Lift is measured

During the test, lift for each time unit is the gap between what the treatment zones actually did and what the synthetic control predicts they would have done without the change. Summed over the test, that gap is your incremental value; expressed against spend it becomes your incremental ROAS / epROAS / CAC.

At what level inference runs

Inference always runs at the geographic-zone level, on data aggregated by period (daily by default, with a weekly option). It does not run at the individual-user or individual-campaign level. This is deliberate. A geo experiment compares whole regions, so it captures the total effect of a marketing change — including effects that user-level platform studies miss (people who were influenced but never clicked, cross-device journeys, and offline or word-of-mouth spillover). For why this is more trustworthy than platform-reported, user-level lift studies, see Platform lift studies.

Each metric is measured independently

When you switch the metric dropdown (Gross Sales, Net Sales, Net Gross Profit 2, New / Returning Customer profit, New Customer Count), Dema does not re-scale a single shared result. Each metric is measured by its own, independent synthetic control model, fitted on that metric’s own zone-period panel.

Because each metric is a separate model, the lift, confidence interval, and p-value for one metric do not mechanically follow from another. It is normal and expected for, say, Gross Sales and Net Gross Profit 2 to show different magnitudes — and occasionally different signs — for the same test. A profit metric can move differently from a revenue metric because returns, discounts, and margins behave differently across regions. This is a feature of measuring what you actually care about, not an inconsistency in the data.

How significance is established

A measured gap between treatment and control is only meaningful if it is unlikely to have happened by chance. Dema establishes this with a randomization (permutation) test rather than a textbook formula that assumes a particular data distribution:

Dema repeatedly reshuffles the timing of the observed differences to simulate what “no real effect” would look like for your specific data.
The p-value is the share of those random arrangements that look at least as extreme as your actual result. A low p-value means a gap this large rarely appears by chance.
The confidence interval shown on the charts expresses the same uncertainty as a range around the estimate. See Understand test results for how to read it.

The quality of the underlying match is reported separately, so you can judge before trusting a result whether the synthetic control was a good fit. Those diagnostics are covered in Analyze a suggested experiment.

Why numbers can look different

It’s common to compare two views and notice they don’t tie out exactly. In almost every case this traces to one of the following — none of which means a result is wrong.

The chart is a focused view; the model learns from much more

The treatment and control lines on the results charts are a zoom around the test window — they show roughly the test period plus a short lead-in, because that’s the part you want to inspect. The synthetic control model itself is fitted on a far longer stretch of history (on the order of a year) so it can learn each zone’s seasonality and trend. So if you sum the values visible on the chart, you are summing the display window, not the full history the model used. The headline incremental value is computed by the model over the test window using that longer-trained baseline — it is not meant to equal a hand-sum of the visible chart points.

Results settle for a few days after the test

Order data is not final the moment an order is placed. Refunds, cancellations, late-arriving conversions, and geographic enrichment all continue to settle for a few days. As a result:

Pre-test history is stable — it has long since finalized, so the model’s baseline and its fit quality barely move between runs.
The most recent test/post-test days keep adjusting as data finalizes, which is why the incremental value and ROAS can shift slightly if you re-open a result immediately after the test versus a few days later.

If you need to compare two numbers exactly, compare results computed at the same time and over the same window. Reading the headline metric a few days after the test period closes gives the most stable figure.

Aggregation cadence

Inference aggregates orders into periods (daily by default, or weekly). If you compare against a report built on a different cadence, totals can differ slightly even though they describe the same underlying orders.

Each metric is its own model

As covered above, switching the metric re-fits a separate model. Differences between metrics are expected, not a sign of instability.

For any result you’re unsure about, the fastest check is the model-quality diagnostics on the experiment — they tell you directly whether the synthetic control was a trustworthy match. See Analyze a suggested experiment.

​From raw orders to a result

​At what level inference runs

​Each metric is measured independently

​How significance is established

​Why numbers can look different

​The chart is a focused view; the model learns from much more

​Results settle for a few days after the test

​Aggregation cadence

​Each metric is its own model

From raw orders to a result

At what level inference runs

Each metric is measured independently

How significance is established

Why numbers can look different

The chart is a focused view; the model learns from much more

Results settle for a few days after the test

Aggregation cadence

Each metric is its own model