Total platform outage due to Domain Controller failure
Incident Report for 2sms LLC
Postmortem

Incident Report

 

Start Date: 03/07/2022 10:40 AM (CT) / 07 March 2022 16:40 (UTC)

 

Finish Date: 03/07/2022 12:00 PM (CT) / 07 March 2022 18:00 (UTC)

 

Description:

 

Total outage of all services.

 

Impacted Services:

 

  1. All

Impacted Customers:

 

  1. All

Cause:

 

All internal DNS ceased to function; this was the result of a primary domain controller failure.

 

Detection:

 

Staff were alerted by internal monitoring systems that there were multiple services failing concurrently. All staff were brought into an incident call to investigate the issue.

 

Scope of incident:

 

This affected all customer facing services as well as support, communication, and administrative services.

Corrective Actions:

 

Individual services were checked however when it was determined that a full outage was occurring, the infrastructure team launched a full-scale investigation. The DNS issues were traced to the primary domain controller acting as the default gateway. Attempts were made to resolve the issue however when this was not possible disaster recovery procedures were executed to transfer over to the secondary failover domain controller.

 

Preventative actions:

The primary domain controller is completely offline and appears to be corrupt. We are raising a task to build a new domain controller with last know configuration. While we are doing this we will run with the single domain controller with extra alerting and monitoring.

Posted Mar 11, 2022 - 13:48 UTC

Resolved
Total outage of all services.
Posted Mar 07, 2022 - 18:00 UTC