On Monday, May 8 around 6:00 am PDT, Inkling engineering was alerted to the fact that newly created users were not searchable. It was determined that two queues had backed up due to a long running internal process. The process which led to the backup was immediately stopped, and the capacity of a backend component upgraded to process the extra load. The queues began to recover, but not without impact. User updates and automated assignments experienced some delay starting May 4.
The user data queue was cleared and caches repopulated fixing the user search issue by 7:05 PM PDT on the day of the incident. The second queue, one for Learning Pathways automated assignments, took until 4:29 PM PDT on the following day to clear. Unfortunately, when that queue temporarily filled, some Learning Pathways events were missed and code needed to be written to replay events. In some cases, this created duplicate assignments which Inkling engineers will delete once the replay is complete, upon customer approval. Duplicate assignments created by some customers as a workaround were deleted earlier this week. Affected customers have been contacted by their CSM.
In response to the incident, Inkling has added comprehensive monitoring and alerting for the queues in question. We have also increased the data retention on the queues to reduce the likelihood of overflow. Additional capacity was added to the affected cache and related systems.