Context: you have received an alert, notifying that one of your monitoring rule has failed. You want to start your troubleshooting process, but do not know where to start.
You should first take a look at the associated incident.
A typical workflow for an opened incident would be:
- Assign it to someone in your team
- Use the
Compromised assetssection to alert any team that could be impacted
- Use the
Lineagetab to get a better understanding of the upstream assets
- Comment the resolution in the
- Close the report by providing a final status:
According to the type of rules that has failed, or the ownership of the corresponding tables, the incident should be assigned to someone of your team, responsible for handling the issue.
You will find the incidents that have been assigned to you on your
Overview page of an incident report, you will find a list of all the compromised assets, that means all the assets depending from the one on which the rule failed. In order to prevent any propagation of "bad" data, you should alert the teams working on those assets.
You should use the
Lineage page to understand all the upstream dependencies of the asset. This should help you find the root cause and save a lot of time in troubleshooting. Find more info here.
In order to easily communicate among your team and keep a record of your troubleshooting actions, you should use the
Timeline in the
Overview section. Make sure each action is correctly updated and commented.
Once the incident is solved, you should close the report by giving a status, using the dropdown list at the top-right corner of the page.
Updated 17 days ago