Referential Integrity violations in a Data model, what are those?

July 08, 2021 by Mihaly Kavasi 5 min read

This blog post is following on my previous posts using community tools to improve your data model. Previously we used Tabular Editor: here and here. Now let's define what RI (referential integrity) violations are? This is how it is described in the official documentation:

The blank virtual rows are effectively Unknown Members. Unknown members represent referential integrity violations where the "many" side value has no corresponding "one" side value. Ideally these blanks should not exist, and they can be eliminated by cleansing or repairing the source data.

https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-relationships-understand#regular-relationships

There are two types of violations:

Type 1 - Fact table missing key value

One reason for referential integrity problems if you miss the id from the fact table.

Type 2 - Dimension missing key value

The other reason for referential integrity problems if you miss the id from the dimension table.

Result:

In both cases a blank row will appear in a visual, aggregating all the rows that cannot be matched to any of the dimension values.

Why this can be a problem?

In terms of performance for simple and small models this won`t really influence performance. But in large, complex, or direct query data models can cause problems.

It is the most obvious in DQ. If you have referential integrity violations than the joins in the query PBI generates will need to be outer join instead of inner join. Therefore, there is an option in relationships to assume referential integrity because that have substantial performance effect.

Even if you are in import mode these issues can have an effect on your performance, although less obvious.

DAX Studio - Data model metrics:

You can use DAX Studio to uncover RI violations.

Tables view - # Referential Integrity Violations

Relationship View - Sample Violations and affected row count

How to solve Referential Integrity violations?

The reason we separated the violations into two types is that the solution for these is different.

Type 1 - Fact table missing key value

First always check if the data is valid if the missing value is acceptable. If not fix at the source.
If missing values is valid, create an empty/undefined row in your dimension and map your missing fact rows to that.

Type 2 - Dimension missing key value

If you have values in your fact table that are absent in your dimension table than it most likely means that there is an issue with the logic you prepared your data or the data source itself.

There can be few explanations why you have this:

data value in the fact table was entered incorrectly and that id never existed
id got deleted and there was not foreign key constraint on the tables
only active ids were loaded in

So, the solution is to correct the data preparation to make sure all the ids are added to the dimension tables or invalid facts are removed or corrected.

There is an excellent blog post by Phil Seamark about this:

https://dax.tips/2019/11/28/clean-data-faster-reports/

Resources:

Dax Studio: https://daxstudio.org/
Model Relationships Documentation: https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-relationships-understand

Summary

If your data is perfect, as it usually is ?, you do need to care about this. But if you are among those unlucky few who must work imperfect data than this can mean the difference between usable and unusable performance.

About the Author

Mihaly Kavasi

Data-driven Decision-Making Enthusiast. Power BI Expert and Trainer. Helping organizations and communities to use data effectively.

Working @Avanade

Speaker profile @Sessionize

View All Posts