Habitat Performance Degradation

Resolved
Major outage
Started over 1 year ago Lasted about 3 hours

Affected

Habitat
Updates
  • Resolved
    Update

    Around 9:03 AM PT on Tuesday, September, 27, 2022, Inkling experienced an increased amount of traffic to our site. That increased traffic exceeded certain operating system limits of our external-facing load balancers, causing them to reject incoming requests despite the servers not being overloaded. The traffic increase was a result of increased customer usage of the Inkling platform. The load balancer outage prevented external traffic from entering the application environment, causing a site outage. During the outage, traffic would fail, the connection counts would drop into acceptable levels, and then quickly fill up again, which meant services were intermittently and briefly available before going offline again. Inkling engineers diagnosed this problem and set about increasing the limits. The issue was compounded by the fact that our Analytics platform receives all traffic through these same load balancers. This system is designed to attempt to reprocess analytics events until they succeed, which caused a backlog of these events trying to work their way through the system. These constant retries further increased the load and delayed our ability to resolve the problem. Inkling personnel drained these events slowly to relieve that pressure, which resulted in bringing the site back up. In order to restore system operations in a timely manner and minimize customer impact, the Inkling team had to make some difficult choices which caused a very limited amount of data loss \(see details below\). Altogether, these issues resulted in 1 hour, 13 minutes of intermittent availability of Inkling services. As part of the issue resolution, Inkling personnel have increased the capacity of the load balancer servers to allow for increased traffic and configured the operating system's network stack accordingly. Additionally, to prevent recurrence of similar issues in the future, Inkling deployed new monitoring and metric visualizations to ensure the team is alerted well in advance of the system approaching operating system limits. Following the event, Inkling personnel reviewed additional operating system parameters related to connection capacity across all load balancers and made any necessary changes to allow for utilization growth, upgraded the servers configurations with a larger capacity and added monitoring alerts. As part of the root cause analysis process, Inkling has created additional development tasks that will reduce the compounded effect of retry logic, and will eliminate the possibility of data loss. Details about events lost as a result of the outage: * The vast majority of lost events are utilization metrics \(e.g., page views or click throughs\) emitted during the outage period. Customers should be aware that utilization metrics for the outage & recovery period \(between 9:01AM Pacific time and 12:41PM Pacific time\) are incomplete. Unfortunately, despite our best efforts, these events are not recoverable. * For Learning Pathways users, no data was lost. A very small number of events \(less than 10 across all customers\) has not been recorded into the analytics database; the Inkling engineering team had already replayed these events on Thursday, 9/29/2022. At this time, all Learning Pathways data was synchronized across systems. * A subset of customers experienced a loss of assessment results events \(submitted using the Assessment Widget\)--these customers will be contacted directly by Inkling support staff with regards to the recovery timeline.

  • Resolved
    Resolved

    This incident has been resolved.

  • Identified
    Identified

    The issue has been identified and we are working towards a fix.

  • Investigating
    Investigating

    Inkling has received reports of Habitat performance degradation and our team is currently investigating.