The Missed Opportunities of Log Files

Log files could be a great place to find the answer on what went wrong in our applications. They should contain all the necessary information to understand the state of the application before and while the error happened. Instead most log files are giant collections of data without much structure and even less meaning. What is unfortunate in one single application is a problem as soon as multiple applications need to work together.

Every developer has its own idea how a log message should be structured, what is important and how it should be represented. Trying to combine log files from multiple applications is therefore a lot of work and the chances to miss the important insight is high.

This post is part of the Improve Your Log Messages series. You can find the other parts here:

Everything goes

On top of all that there are tremendous differences in what is logged and how readable they are. Log files tend to be more understandable when the developer used them to figure out what his application does. In this case a log message may look like this one:

46:29 INFO  ataraxis.gui.AtaraxisStarter - ##### Entering splash screen at: Tue Jul 29 20:46:29 CEST 2014 #####
46:30 DEBUG ataraxis.crypt.AESKeyCreator - createSecretKey() - start
46:30 DEBUG ataraxis.crypt.AESKeyCreator - Size of AES-Key: 256
46:31 DEBUG ataraxis.crypt.AESKeyCreator - createSecretKey() - end
46:31 DEBUG ataraxis.crypt.AtaraxisCrypter - JuristicationTest was ok
46:31 INFO  ataraxis.gui.AtaraxisStarter - Splashscreen: policyIsOK => true
46:31 DEBUG ataraxis.gui.AtaraxisStarter - ##### Exiting splash screen at: Tue Jul 29 20:46:29 CEST 2014 #####

Having a great debugger and therefore no need to read the produced log messages may easily result in something like this:

07/27/2014 22:23:25.33  OWSTIMER.EXE (0x1328)                       0x19EC  SharePoint Foundation           Monitoring                      b4ly    Medium      Leaving Monitored Scope (Timer Job Health Statistics Updating). Execution Time=31.6552103689156 81eb6a9c-f96d-47e6-a7ba-0de19dd2c868
07/27/2014 22:23:25.37  wsstracing.exe (0x0384)                     0x1EA4  SharePoint Foundation           Unified Logging Service         b9wt    High        Log retention limit reached.  Log file 'C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\LOGS\SRV104-20140713-2153.log' has been deleted.     
07/27/2014 22:23:25.37  wsstracing.exe (0x0384)                     0x1EA4  SharePoint Foundation           Tracing Controller Service      8096    Information Usage log retention limit reached.  Some old usage log files have been deleted.  
07/27/2014 22:23:25.50  OWSTIMER.EXE (0x1328)                       0x1650  SharePoint Foundation           Usage Infrastructure            a5rv    High 

Even messages like those can be understood. All it takes is a lot of time, a good tool and a high threshold for frustration.

Implicit structure...

When you know more about the different log formats you can start to understand the structure behind them. Understanding the structure will give you insights on the purpose of the entries and from there you may get a good enough picture of what’s going on. Unfortunately every developer has different ideas on how things should work. Therefore you will have to start from scratch as soon as you open the log files of another application. Even when they do the same things and should print out a similar set of entries, the implicit structure may result in completely different messages.

With more training you will use less time to understand another application, but it still may be too much time. Especially when the application is down and you quickly need to understand the reason for the current crash. There must be a better solution than starting over and over again.

...made explicit

It would all be simpler if the implicit structure of the log messages would be explicit. We wouldn’t have to guess what the field represents - we would simply read the description and know what it does. Not only what it should do but what it actually does. With this information we could get all error codes from all log files that occurred near the time of the crash and we would instantly know which system was the source of the error.

To reach a state like this we not only need to have a description for all fields, but a structure that contains the data in a parsable format. Only then will it be possible to combine the different sources automatically and extract the knowledge in one single place. Adding another system to our application portfolio would only need an initial configuration for the mapping. From then on it would deliver its messages in the same useful way all the other applications already do. Would that not be great?

And imagine what else we could get out of this data. We could even go that far and use real data collected on our own systems to optimize our applications. Our log files would no longer be a useless collection of garbage, instead they could be a useful source for fact based decisions.

Over the next weeks I will show you different ways to write better log files (using Serilog), collect and integrate them (with RavenDB and Elastic Search) and finally use tools to understand what’s going on in your application (using Seq or Kibana). The journey to this new level of understanding starts next week with Serilog.

The Missed Opportunities of Log Files

Everything goes

Implicit structure...

...made explicit

Next