Investigating Incident

Update

08 Feb 2024 at 23:44 GMT+0

Update

08 Feb 2024 at 23:44 GMT+0

At approximately 3:00 AM PT on February 1, 2024, an AWS-automated process intended to maintain data redundancy while replacing defective hardware caused large amounts of data to be placed on a single server, resulting in its storage volume reaching capacity. This prevented new user and attribute data from being recorded. Inkling Engineering increased the storage capacity of these servers, which resolved the issue of saving incoming changes to this data. All of the jobs which had failed during the incident were re-run to restore the proper data state and ensure proper distribution of Inkdocs. All told, the issue persisted for approximately 5 hours and 36 minutes.

Monitoring surrounding this issue did trigger alerts, but they were set to a low priority which prevented them from alerting the on-call engineer. Engineering has added severity to certain of these alarms to improve response time in the future. Inkling is also investigating multiple approaches to managing the size of this data, which will make this sort of routine automation operation more efficient.

Resolved

01 Feb 2024 at 17:44 GMT+0

Resolved

01 Feb 2024 at 17:44 GMT+0

The incident has been resolved

Monitoring

01 Feb 2024 at 17:35 GMT+0

Monitoring

01 Feb 2024 at 17:35 GMT+0

We have resolved the issue and are monitoring.

Identified

01 Feb 2024 at 17:15 GMT+0

Identified

01 Feb 2024 at 17:15 GMT+0

We have identified the problem and are working towards resolution.

Investigating

01 Feb 2024 at 17:05 GMT+0

Investigating

01 Feb 2024 at 17:05 GMT+0

We are currently investigating issues with Inkdoc assignment and new users appearing in the People tab in Habitat.

Inkling - Investigating Incident – Incident details