One common but frustrating pain point is when the distribution of your data changes. Even with regular checks, it remains challenging to notice, and its impact can be significant.
One typical case for data scientists is when the distribution of a feature is slowly changing over time, leading to data drifts.
Another frequent case for analytics engineers is when a new value is being added to a categorical field, which can break all the downstream dependencies if not accounted for.
Introducing distribution change monitoring:
Sifflet leverages advanced static models to automatically detect distribution changes at the field level, based on a rolling or a fixed reference.
With this feature, data engineers and consumers can be ahead of any unexpected distribution change impacting numerical or categorical fields.
Limit on the number of categories
We currently limit the number of different categories to 1000.
Above this threshold, the rule will have a "technical error" status.
The Pearsons's Chi-Square χ² Test statistic is used to capture any statistically significant distribution change.
You can compare the distribution of each category.
For field with numerical values, they are divided into buckets.
The template "Distribution Change" can be found in Category "Field Profiling".
You will also have two options during the configuration:
Fixed reference date: in this case, the computed distribution will be compared to the fixed distribution computed at a chosen specific date. This allows you to compare your current data with the data you had at a specific point in time. This is particularly useful when you don't expect your data to evolve.
Rolling reference date: in this case, the computed distribution will be compared to the distribution computed at a prior date, with a chosen constant time delay between the two dates. For instance, this allows you to compare today's daily distribution with last week's one. It is useful when you want to be alerted when there are sudden changes in distribution.
Two different modes
Using a rolling reference date will allow to capture a quick change of distribution whereas using a fixed reference date will better capture a slow drift.
Consequently, both options are not exclusive and can be considered complementary.
- I expect the data in my table to be growing in terms of rows, which means that the number for each category will increase over time. Does it mean that the rule will fail systematically?
The Distribution Change Rule monitors the relative change and not an absolute one.
So an increase of counts for each category would not affect the result of the rule.
- At date T, you have a given distribution: (A, 1), (B, 3), (C, 5)
- The day after T+1, the distribution is: (A, 2), (B, 6), (C, 10). The count for each category doubled whereas the relative distribution remains the same.
In this case the Distribution rule will not raise an alert.
Updated about 1 month ago