While the attack is running, open the website in a browser and try using it like a regular user. Do you see anything unusual?
Chances are when you ran this experiment, you saw the following screen:
Customers would not be happy to see this. The page is clearly not working, but there aren’t any notifications or error messages explaining why. Ideally we would have some form of redundancy to prevent this from happening in the first place.
If you haven’t already, make sure to press the halt button in Gremlin to stop the experiment:
We expected something like this to happen in our hypothesis, and now we’ve verified it. We found and demonstrated the failure mode before it could become a real production outage. We can now work on adding fixes to prevent this problem from happening again, then repeat the experiment to make sure those fixes are effective.
One last action item to consider: did you see any indication of the outage in CloudWatch? If not, you might want to see which metrics could detect an issue like this and create an alert so that your team is notified if this happens again.