Distribution change

Overview

The Sifflet Distribution Change Monitor Template is a field-profiling monitor allowing for detecting change in data distribution. It leverages advanced statistical models to automatically detect distribution changes at the field level, based on a rolling or a fixed reference. It can be used for numerical or categorical fields.

How to

How does it work

The Sifflet Distribution Change Monitor computes data distributions at 2 given points in time and uses the Pearsons's Chi-Square χ² Test statistic to capture any statistically significant distribution change between them, creating an Incident and sending Notifications when the difference between them exceeds a set of defined conditions.

Exactly what data is being used for each of those Distributions depends on a set of parameters, most importantly including:

How to read graphs

Results are presented in a graphical form. Depending on the type of analysed data, the graphs vary slightly.

Categorical values

For categorical values, all of them are displayed next to each other, sorted in a descending order by current values categories values count.

Numerical values

Numerical values are divided into buckets, all of them are displayed next to each other, sorted in a descending order by current values categories values count.

📘

Limit on the number of categories

Sifflet limits the number of categories to 1000. Above this threshold, the rule will throw a "technical error" status.

How to configure

The template "Distribution Change" can be found in the "Field Profiling" Category.

Monitoring Type - Dynamic vs Static

It's possible to choose between 2 monitoring types: dynamic or static.

Dynamic

The rule fails if the statistical test finds any anomaly based on the previous trends. There is no need to define a Threshold.

Static

The rule uses a pre-defined threshold as a point of reference. It fails, if at least one category has its distribution changes higher than a set Threshold.

Reference date - Fixed vs Rolling

This setting defines what the computed distribution will be compared to.

Fixed reference date

The computed distribution will be compared to the fixed distribution computed at a chosen specific date. This allows you to compare your current data with the data you had at a specific point in time. This is particularly useful when you don't expect your data to evolve.

Example use case

Compare weekly distribution for every week of the year, with a pre-defined weekly distribution from the Week 12, year 2018.

Rolling reference date

The computed distribution will be compared to the distribution computed at a prior date, with a chosen constant time distance between the two dates. It is useful when you want to be alerted when there are sudden changes in distribution.

Example

Compare today's daily distribution with a daily distribution from the last week.

1259

"Fixed reference date" vs "Rolling reference date"

📘

Fixed vs Rolling Reference Date

Using a Rolling Reference Date allows capturing of a quick change of distribution, whereas using a fixed reference date better detects a slow drift.

The two options, however, aren't mutually exclusive and can be considered complementary. Consider configuring two monitors, one using each Reference Date Mode, if you think it'll be work for your scenario.

Use cases

A typical case is when the data distribution is changing gradually over time, leading to data drifts. Another frequent case is when a new value is being added to a categorical field, which can break all the downstream dependencies if not accounted for.

FAQ

I expect the data in my table to be growing in terms of rows, which means that the number for each category will increase over time. Does it mean that the rule will fail systematically?

The Distribution Change Rule monitors the relative change and not an absolute one.
So an increase of counts for each category would not affect the result of the rule.
For example:

  • At date T, you have a given distribution: (A, 1), (B, 3), (C, 5)
  • The day after T+1, the distribution is: (A, 2), (B, 6), (C, 10). The count for each category doubled whereas the relative distribution remains the same.
    In this case the Distribution rule will not raise an alert.