Issue: TLS Negotiation failed, Certification Invalid for US subscriptions.
Incident Length: 6 hours and 43 minutes
Incident Date: 04/12/2024, UTC 00:52 – 07:35 04/12/2024
Incident Status: Resolved
Summary
Customers reported encountering the error ‘TLS Negotiation failed, Certification Invalid’ when routing messages through Exclaimer’s Server-side system. This issue only impacted subscriptions located within US regions.
Not all servers were affected by this issue, so this did not impact all messages being sent through Exclaimer during the incident.
Going forwards, a new process for reviewing and confirming certificate updates has been introduced to prevent similar issues in the future.
Root Cause
Due to an unexpected oversight, the certificate for US relays was not updated on all routing servers. Resulting in the previous certificate expiring and no longer being valid for mail routing.
Mitigation
Once the expired certificate had been identified within the infrastructure as not correctly updated and applied. The certificate was replaced with the renewed certification to resolve the issue. All other instances of the certificate were then reviewed and verified to have been correctly updated
Incident Timeline
00:52 – Alerting advised to a failure to obtain a message response at times within the US region
01:29 – Initial investigation indicated that all endpoints were responding to requests and were accessible to engineering staff, suggesting full operation of the Exclaimer system.
01:45 – Investigation continued to confirm that traffic routing also did not advise of an issue. However, traffic flow to other US servers remained higher.
02:13 – Another full review indicated that the system was operating as expected with no errors being reported in the run up to the alert being generated.
02:25 – Alert was documented and prepared for pickup during main operational hours
03:40 – Support alerted team to customer facing reports of messages being rejected with ‘Certification Invalid’
03:52 – Identified that the active certificate on US2 was showing as having recently expired.
04:06 – Engineering confirmed that no other services were also attempting to use the expired certificate.
04:21 – A full review of all servers and an update of US2 to ensure only the latest certificate was applied was completed.
04:31 – Engineering confirmed system recovery of the original alert, and an improvement of traffic flow between all servers. Incident moved into Monitoring status.
07:35 – Support confirmed that new reports of the issue had ceased, and existing reports confirmed the issue was no longer occurring.