At approximately 3:00 AM PT on February 1, 2024, an AWS-automated process intended to maintain data redundancy while replacing defective hardware caused large amounts of data to be placed on a single server, resulting in its storage volume reaching capacity. This prevented new user and attribute data from being recorded. Inkling Engineering increased the storage capacity of these servers, which resolved the issue of saving incoming changes to this data. All of the jobs which had failed during the incident were re-run to restore the proper data state and ensure proper distribution of Inkdocs. All told, the issue persisted for approximately 5 hours and 36 minutes.
Monitoring surrounding this issue did trigger alerts, but they were set to a low priority which prevented them from alerting the on-call engineer. Engineering has added severity to certain of these alarms to improve response time in the future. Inkling is also investigating multiple approaches to managing the size of this data, which will make this sort of routine automation operation more efficient.