Incident/Outage Dashboard

At my first engineering co-op, we sometimes had unplanned outages. After an outage, there also was little communication on how long outages had lasted, what systems were affected, and what was being done to stop them from happening in the future. My manager felt that we could use some more transparency with these events, so tasked me with creating a system to log events, and to perform an autopsy afterwards.

After some discussion and trial and error, it was decided that the system would use a submission form that was paired with a back end script, and a website dashboard. This dashboard was to be a historic record of all incidents and outages, include relevant information, as well as be a place to check on ongoing incidents. See below for a screenshot of one of the later prototypes, which makes use of the Google Charts JavaScript API. Unfortunately, this dashboard is hosted on an intranet website, so I can’t leave a link to the final product.

This is a screenshot of the dashboard. (Identifiable company information redacted)

This dashboard is populated by real-time data (the screenshot is of a slightly older prototype). The entire process flowdown is as follows:
A Google Forms sheet is filled out by a member of the help desk, which either creates, updates, or closes an incident. (If it is updating or closing, the ticket number must be given). This writes to a Google Sheet, which is being polled every few minutes by a Python script running on a server.

If it is a new incident, the Python script creates a new ticket and inserts relevant information, as well as mass-emailing those who are affected. It also writes this information to a file, each incident labelled by its ticket number.
If it is an ongoing incident, the process is similar, though instead of creating a new ticket, the script will update or close the ticket as specified by the Google form.

When a user then visits the intranet website, the JavaScript simply reads the incident data from a symlink (linked to the file that the Python script writes to). This data is then displayed for the user, fitting to the constraints given by the dates at the top (default view is one week).

The final product is a dashboard that can be searched by date ranges, with each event showing more information in the tooltip. This information includes: A description of the outage, what systems are affected, the engineer responsible, the day it started and ended, and includes a link to the ticket.

This project caused me to learn about OAuth, JavaScript, Symlinks, and I learned more about user interface. Thanks for checking out this post!

Leave a Reply