How to utilise an incident

Context: you have received an alert, notifying that one of your monitoring rule has failed. You want to start your troubleshooting process, but do not know where to start.

You should first take a look at the associated incident.

Typical workflow

A typical workflow for an opened incident would be:

  • Assign it to someone in your team
  • Use the Compromised assets section to alert any team that could be impacted
  • Use the Lineage tab to get a better understanding of the upstream assets
  • Comment the resolution in the Timeline
  • Close the report by providing a final status: Fixed, False Positive, Expected or Known issue

Assign the incident

According to the type of rules that has failed, or the ownership of the corresponding tables, the incident should be assigned to someone of your team, responsible for handling the issue.

You will find the incidents that have been assigned to you on your Dashboard.

Alert the impacted teams

On the Overview page of an incident report, you will find a list of all the compromised assets, that means all the assets depending from the one on which the rule failed. In order to prevent any propagation of "bad" data, you should alert the teams working on those assets.

Look for the root cause

You should use the Lineage page to understand all the upstream dependencies of the asset. This should help you find the root cause and save a lot of time in troubleshooting. Find more info here.

Comment the resolution

In order to easily communicate among your team and keep a record of your troubleshooting actions, you should use the Timeline in the Overview section. Make sure each action is correctly updated and commented.

Close the report

Once the incident is solved, you should close the report by giving a status, using the dropdown list at the top-right corner of the page.