So if all is good and well, you have had your changes approved and you can release (yay!). But is it really yay? What is something goes wrong and you need to be able to fix it quickly…This is where monitoring comes in.
At the moment we are using splunk and it is a really great log aggregator. You can take all your log files and get some meaningful information about what your users are doing. How many transactions are successful/failing? What cards are customers using? When are peak times? And in the case of errors, it can give you the exact service that is returning the error from a graph. What is great about seeing your logs returning useful information live is that they can also tell you that in some cases you are not doing great logging. And so, you can go and add better logging 🙂
A note about splunk, the search mechanism it uses is all based on filters and field extractors. For example, let’s say you want to see transaction amounts against card type. You have to extract both these field from the logs then do a search query based on these field extractors.
The key to using these kind of tools usefully and successfully, is to have meaningful logs in the first place. You have to have done that work. This is a place where you want to know instantly whether everything is okay or not….