System Health - Simplifya

On Wednesday morning, September 28, 2022, users of our compliance platform began to report problems logging into the app. We were able to quickly determine the issue was at AWS where the app is hosted. AWS is sharing updates with us, and we are sharing them below unedited, in reverse-chronological order.

Update: At this point, the issue is considered resolved at AWS and we are in the process of running system diagnostics in the Simplifya environments. If you are still experiencing an issue, please contact support, support@simplifya.com.

[02:05 PM PDT] As of 1:43 PM PDT, error rates and latencies for invoked on API Gateway endpoints in the US-WEST-2 Region are now at normal levels. The issue began at 9:20 AM PDT when error rates and latencies for API Gateway began to increase. Error rates began to improve at 10:38 AM PDT, when engineers took action to reduce contention within the subsystem that handles request processing for API Gateway. Error rates continued to improve until 1:10 PM PDT, when engineers applied a mitigation to resolve the contention within the affected subsystem. These actions accelerated recovery, and by 1:43 PM PDT, error rates and latencies had returned to normal levels. Affected AWS services have now recovered as well. The issue has been resolved and the service is operating normally.

[01:42 PM PDT] Starting at 1:31 PM PDT, error rates and latencies for invokes on API Gateway endpoints in the US-WEST-2 Region are now close to pre-event levels, and we continue to work on the remaining hosts that are affected by the contention issue. Several AWS services, including AWS Connect and Lambda are seeing signs of strong recovery. We expect all services to recover as API Gateway error rates and latencies return to normal levels. Customers should be seeing recovery at these error levels as well. We will continue to provide updates until the error rates and latencies have returned to normal levels.

[01:22 PM PDT] Starting at 1:12 PM PDT, we saw a further reduction in error rates and latencies for invokes on API Gateway endpoints in the US-WEST-2 Region. This was a result of the latest mitigation which addressed contention within the component in the subsystem responsible for request processing within API Gateway. Error rates are now at levels where some customers may begin to see recovery, and retries will begin to work more consistently. We will be applying the mitigation to the remaining hosts affected by the contention issue and expect to see further recovery from them in the next 30 minutes.

[01:03 PM PDT] Error rates and latencies for invokes on API Gateway endpoints in the US-WEST-2 Region, continue to hold steady. Engineers continue to work on resolving the contention affecting the subsystem responsible for request processing. We recently completed a mitigation that should help to reduce error rates and latencies to normal levels and will have further updates on the result of that change in the next update. Although Lambda function invocations are not affected by this issue, the Lambda console is experiencing some error rates which we are investigating. Other AWS services affected by this issue remain in much the same state, waiting on the recovery of API Gateway.

Simplifya Note: It was around this time, 1:03 PM PDT, that we began to notice we could occasionally log into the compliance platform successfully. However, it was not a consistent behavior.

[12:26 PM PDT] We continue to see an improvement in error rates and latencies for invokes on API Gateway endpoints in the US-WEST-2 Region, but have not fully resolved the issue. While our mitigations have improved error rates and latencies, we have also identified the root cause of the event. The subsystem responsible for request processing experienced increased load, which ultimately led to contention of a component within the affected subsystem. Engineers have been working to resolve the contention of the affected component, which has led to a reduction of error rates and latencies. The path to full recovery involves addressing the contention across the subsystem, which we are currently doing. As that progresses over the next two hours, we expect recovery to continue to improve. Customers with applications that use API Gateway will be experiencing elevated error rates and latencies as a result of this issue. Lambda is not affected by this event, but customers using API Gateway as an HTTP endpoint for Lambda will experience increased error rates and latencies. Other AWS services listed below are also experiencing elevated error rates as a result of this issue. For customers that have dependencies on API Gateway and are experiencing error rates, we do not have any mitigations to recommend to address the issue on the customer side. We do expect error rates to continue to improve as contention with the affected subsystem resides, and will provide further updates as recovery progresses.

[11:33 AM PDT] We continue to work on resolving the elevated error rates and latencies for invokes on API Gateway endpoints in the US-WEST-2 Region. We continue to see a significant improvement in error rates, starting at 10:40 AM PDT, but are not seeing full recovery yet. The issue is caused by contention within the subsystem that is responsible for request processing within the API Gateway service. Engineers are engaged and have applied traffic filters as a precautionary measure, while they work to identify the root cause and resolve the issue. Engineers continue to work to reduce contention within the affected subsystem, which we believe will resolve the elevated error rates and latencies. Customers with applications that use API Gateway, or customers invoking Lambda functions via API Gateway, will be experiencing elevated error rates and latencies as a result of this issue. The AWS services listed below are also experiencing elevated error rates as a result of this issue. While we have seen improvements in error rates since 10:40 AM PDT, recovery has stalled and we do not have a clear ETA on full recovery. For customers that have dependencies on API Gateway and are experiencing error rates, we do not have any mitigations to recommend to address the issue on the customer side. We do expect error rates to continue to improve as contention with the affected subsystem resides, and will provide further updates as recovery progresses.

[10:59 AM PDT] We continue to see elevated error rates and latencies for invokes on API Gateway endpoints in the US-WEST-2 Region. While engineers continue to work towards root cause, we have deployed traffic filters from sources with significant increases in traffic prior to the event. As a result of these traffic filters, we are seeing a reduction in error rates and latencies, but continue to work towards full recovery. Although error rates are improving, we do not yet have an ETA for full recovery. The issue is also affecting API requests to some AWS services, including those listed below. Amazon Connect is experiencing increased failures in handling new calls, chats, and tasks as well as issues with user login in the US-WEST-2 Region. We will continue to provide updates as we progress.

[10:33 AM PDT] We are investigating increased error rates for invokes in the US-WEST-2 Region. We do not yet have a root cause, but are investigating multiple potential root causes in parallel. In addition, we are implementing filters on inbound traffic from a set of sources with recent significant traffic shifts, which may help mitigate the impact. We do not yet have a solid ETA, but will continue to provide updates as we progress.

[10:13 AM PDT] We are investigating increased error rates for invokes in the US-WEST-2 Region.