Thursday, March 12, 2015

ESB Logging (part 2)

This second part of my article about ESB logging describes some techniques that can be adopted to meet the logging requirements described in part 1.

Different techniques and technical choices can be applied at each level of the following generic logging architecture, which should be self-explanatory:
The role of the Logging Transport is to decouple the Logging Publisher from the Logging Data Store and  is an essential one.  A tightly-coupled solution, in which the calling code would have to synchronously wait for data to be persisted to the Logging Data Store before carrying on with its work, would be  inefficient.

Note: in the picture above the logic that serves the UI to browse and search into the Logging Data Store is not explicitly represented;  however this logic must exist and be usable in order to access the data in the store in an efficient way.  This article does not treat this part of the solution.



Details of Logging Architecture Layers


We will now will deal with the three layers in sequence and how they can contribute to providing the  logging functionality we need.

Logging Publisher

The Logging Publisher component must be directly callable from any ESB component via a simple API.   It is imperative that:
  •  The logging call be simple. If not, developers will need too much effort to place logging calls throughout the integration code.  The simplicity of the logging call largely depends on the number of its mandatory inputs.  In most designs, the severity level and the log message  should be the only mandatory inputs that the developer is explicitly required to supply.
  • The logging call be efficient, imposing the minimum possible overhead on the calling code.  The Logging Publisher must decide in the minimum possible time whether to hand over the data to the Logging Transport and, if that's the case, the handover itself must be as quick as possible.
In addition the the "log"  API,  a "flush" API will be necessary if the Deferred Logging technique (described at the end of this article) is adopted.
These APIs constitute a very simple contract between the logging solutions and its clients: the logging solution can freely evolve as long an the contract is not broken.


Explicit and Implicit inputs to logging calls

Part of the logged data are passed explicitly by the caller:
  • Severity Level of the logging entry (e.g., DEBUG, INFO, WARNING, ERROR, FATAL)
  • A Log Message.  This can be a potentially long string of text, which can be include dynamic sections.  It is not, however, to be used to store a whole message (which is the purpose of the document attachments mentioned below)
  • A Log Id to tag the precise location in the code from which the logging call is made; use of this identifier may be optional or mandatory, depending on the logging design
  • One or more document attachments (representations of data objects treated by the ESB component, like for example a service request payload); the persistent storage of documents associated with logging entries is often called archiving and is treated separately later in this article.

Other data elements are normally extracted implicitly and automatically by the Logging Publisher function:
  • A logging time-stamp
  • The name of the ESB component calling the Logging Publisher function
  • The server and port associated with the execution of the ESB component
  • The user id under which the ESB component is executing 
  • The consumer id for which the ESB component is executing; this may not be present in case the ESB component does not implement a service (e.g., in the case the component is of the polling kind); also, depending on the authorization design, the consumer id may be the same as the user id mentioned above
  • Tracking or Correlation Identifiers that unequivocally relate to the service execution and correlate all logging entries originating from this execution (more on this later)
  • Optionally, an incremental sequence number that sequences logging entries generated by the same execution (i.e., that have the same correlation identifier value)
In order to retain all the above information across logging calls from the same execution the Logging Publisher must be stateful.  To achieve this, two main approaches are used:
  1. Deferring logging state to the client of the Logging Publisher API (i.e., the integration code), by having logging state information "carried around" throughout the service implementation logic in some kind of standard "control header" or "execution context" structure.
  2. Caching logging state.
Many times, the first approach is preferred for its simplicity (and marginally better performance).
However, relying on the integration code to always possess a valid reference to context information is not always simple, especially in the face of exception situations.  This requires very careful coding and good QA on the produced integration code.  
The other approach, which is to be considered a best practice, is described next.

Caching logging state 

Caching logging state in memory, using caching solution such as Ehcache,  makes the Logging Publisher component more robust as it does not rely on client code to retain state on its behalf. 
A memory cache (which could be local to the runtime environment or distributed) essentially behaves like a big hash table holding data objects can be accessed in near-constant time based on their corresponding keys.  In many cases, the keys are simply String objects and the values are any kind of serializable objects.
The logging state can be easily wrapped into one such object, and once it gets stored into the cache it can be retrieved reliably from any point of the service or component logic later on. The essential precondition for this is that the corresponding key must be:
  • A unique identifier for the execution instance
  • retrievable in a simple and efficient ways from the ESB runtime
Most ESB products have some form of unique execution identifier.  For example, the Software AG webMethods ESB maintains a globally unique root context id for each synchronous thread of execution.  In BPM contexts, the internal process instance id  (also normally a GUID) can be used for this purpose.

To prevent memory leaks, the ESB service/component/process logic must ensure that the logging state object is purged when a runtime execution instance (service, component, or process) ends, whether successfully or with an error.  The use a of a memory cache rather than a simple hash table is helpful here, as cache products usually allow the specification of a Time-To-Live (TTL) for cache entries.  Thus, even in the unlikely cases in which the ESB logic is not able to purge an entry, it will get purged automatically after expiration (i.e., when its age exceeds its TTL).

It goes without saying that the TTL must be tuned in order to exceed the maximum potential lifetime of any runtime instance.


Conditional logging based on Logging Level

When the Logging Publisher API is called, it will pass its data (received from the caller + collected automatically) to the Logging Transport  in order to persist it.

The decision to persist (or not) a logging entry normally depends on its Severity Level (as passed by the caller) relative to the Severity Level Threshold set at system or ESB component level.
It makes sense in most cases to immediately deliver to the Logging Transport any logging entry  whose severity exceeds the applicable Severity Level Threshold, so there is the shortest possible lag time between the generation and the persistence of these events. 

For better supportability, it is important that Severity Level Thresholds can be set into a data store that is accessed dynamically at runtime, so logging can be "turned up" and "turned down" on the fly by application support staff. 

The logging thresholds (global or by component) must be configurable in a persistent store (such as a DB), but must be accessed very frequently, every time the Logging Publisher function is called.  To avoid excessive overhead, caching is essential.   Application Support could use a custom UI to to alter the logging thresholds in the persistent store, and then force a refresh/reload of the cache to have the new values take effect in real time.

In most designs, logging events having severity lower than the applicable Severity Level Threshold are just ignored, but is some cases they may still be processed later in case of an error (see section Deferred Logging at the end of this article).
The association of sequence numbers to logging entries allows in any case to chronologically sort the entries from a certain execution instance, regardless of which entries were persisted first.


Logging Transport

Generally, once the Logging Publisher invokes the Logging Transport for a logging entry (or a set of related logging entries), the data passed to the transport is considered as committed and will be eventually persisted.  

The Logging Transport must:
  • expose a very simple API towards the Logging Publisher,
  • internally operate asynchronously with respect to the Logging Publisher . 
Once the Logging Publisher decides that an entry is to be persisted, it simply calls the transport API to enqueue the entry  and then returns very quickly to its caller.
The processing from the Logging Transport to the Logging Data Store is completely decoupled.

In the simplest implementations, the Logging Transport just consists in a wrapper around the access protocol of the Logging Data Store.  For example, with a relational database used as Logging Data Store, the Logging Transport API would just wrap a DB call (or set of calls), executed in separate thread.  This solution shows its shortcomings in the face of peaks of logging traffic, where the DB pools size is exhausted and logging entries must be discarded (with loss of potentially important logging information).

A robust solution for the Logging Transport should use persistent queues for temporary storage of log entries to be persisted into the Logging Data Store.   This provides the necessary buffer to cope with variable loads.
The queuing solution used by logging should be totally independent, both architecturally and infrastructurally, from the asynchronous messaging solution used for business critical messages (as part of the integration logic).  A very high logging volume should not compromise the throughput and latency of business critical messages, and certainly we want to completely avoid risk that a flood of log entries could bring the main messaging system to its knees.

The queuing infrastructure used by logging should be lightweight and low-latency, such as for example ZeroMQ, or Software AG / Terracotta Universal Messaging.
These are interesting possibilities for advanced use of the Logging Transport, such as:
  • Holding entries in queue until a "flush" methods is called, and releasing them to the Logging Data Store only then (this is required by a particular implementation of the Deferred Logging technique,  described at the end of this article).
  • Aggregating related entries from the queue before persisting them into the Logging Data Store.  This is useful above all when a NoSQL Document DB is used as Logging Data Store, in order to bundle multiple related entries into a single document.



Logging Data Store

As already stated in part 1, the logging data store must be logically centralized in order to be effective, even if it physically distributed. 
The most common way of implementing a centralized data store for logging is via some kind of database.  ESB tools mostly use relational DB tables for their standard logging solutions, but very often custom logging data stores are designed.  These designs range from a simple set of relational tables to the use of noSQL, document-oriented databases such as MongoDB.

What matters, really, is that the data store design is optimized for:
  • insertions
  • queries by a given set of indexed fields
The latter property is crucial to enable the efficient retrieval of related logging entries. Relationships will be based on the correlation identifiers described later, and possibly also on additional, configurable "classification keys" that may have functional meaning.


Logging Data Store and Message Archiving

If we take away the message archiving requirement, that is, the inclusion of potentially big message payloads into audit logging, then it makes sense to simply use a physically centralized logging data store.
Things change a bit when message archiving is is scope (as it is in most cases).  Having distributed processing nodes send big messages (which could be sometimes in the order of tens of MB or even larger) to a centralized store over the network is inefficient.

A more efficient approach is to store in the central DB everything with the exception of message payloads, and keep these locally at the node that generated them.  The centralized DB will hold location metadata about these distributed payloads:
  • hosting node 
  • unique local access path (typically including a globally unique file name)
The rationale behind this choice it that is is very efficient for the processing node to store a payload locally, while access to these payload will be very infrequent (only when the user browsing the logging data store "drills down" into the payload part of a given logging entry).

The downsides of distributed archiving are mainly:
  • the need for a technical service at each node that delivers a local payload on demand
  • the need for an agent at each node that manages payload retention to manage space according to retention policies (deleting old payloads)
In presence of massive amounts of messages to be archived, the use of the Hadoop Distributed File System (HDFS) can be considered.



Specific logging functionality

  

Correlation

This is a cornerstone aspect of each logging solution, which must be able to handle multiple correlation identifiers (all of which should be GUID's).

Local Correlation Identifier

This identifier  is generated at the beginning of each thread involved in the execution of a service.
If the service is not a simple synchronous service, then each  invocation originates more than one value for this identifier (one value for each synchronous execution context).
Only logging entries generated within the same synchronous execution context share the same value for this identifier. This local correlation identifier can be normally directly derived from the ESB runtime context (e.g., root context Id in webMethods or session id in Mule).

Global Correlation Identifier

This is is the most important correlation identifier. It has the value of the Local Correlation Identifier of the first ESB component that is executed as part of a service invocation.  The service logic must then propagate it across the whole invocation, regardless on whether the components involved are loosely or tightly coupled. This is achieved technically via protocol properties depending of the implementation of the service (e.g., SOAP headers, HTTP headers, JMS properties, etc.)
The ubiquitous availability of this correlation identifier throughout the execution chain is what allows all logging entries to be effectively correlated.
Finally, it is a best practice for services to return the Global Correlation Identifier as part of each service response, allowing consumer application to create a cross-reference between the ESB Global Correlation Identifier and any internal identifier (for instance the one used as "Tracking ID", described below).

External Correlation Identifier ("Consumer Tracking ID")

Technically, this identifier may be handled exactly like the Global Correlation Identifier, but there is an important difference: it is generated externally by the service consumer, not internally by the ESB. Its main purpose is to allow tracking via a "source" identifier that has meaning in the context of the consumer application.  It can thus be called the  "Consumer Tracking ID"

Even though the ESB may enforce that each consumer supply a Tracking ID (rejecting service requests that are lack this ID), the ESB cannot realistically enforce its uniqueness.  For this reason, it cannot normally be used in lieu of the Global Correlation Identifier.  Thus, rather than propagating the External Correlation Identifier throughout  service execution, it is sufficient to simply associate it with the  Global Correlation Identifier in the Logging Data Store.


Logging and BPM

When using a BPM solution to build Orchestrated Task Services, custom logging is usually less necessary as most BPMS tools offer ample process audit and monitoring capabilities.  For each process instance, the sequence of states (i.e., sequence of traversed process steps) is tracked, usually with the possibility of logging selected input and outputs for process model steps.

Virtually all BPM engines natively support the definition of a Process Correlation ID to join events to running process instances.  This correlation identifier normally has functional significance and is to be considered the equivalent of the External Correlation Identifier discussed above.

Each process instance is assigned a unique Process Instance ID (PID) by the BPMS, which is to be considered the equivalent of the Local Correlation Identifier (although it is not in general associated with a single thread of execution).   

For the Global Correlation Identifier, a similar logic can be followed as for non-BPM integrations, by adopting the PID of the "root" process instance and propagating this value to all sub-processes, regardless of whether they are detached or not.  Detached sub-processes will have their own instance ID (= Local Correlation Identifier), but can still share the propagated Global Correlation Identifier (and the Process Correlation ID = External Correlation Identifier).

In case the out-of-the-box functionality of the BPMS does not allow easy audit logging, the techniques described in this post are perfectly applicable to automated business processes,  with the equivalences just mentioned.


 Deferred Logging

In Production environments, it is common practice to log only an entry at the beginning and one entry at the end of the execution of a service or component is case of success (typically, archiving the input and output message at the same time).  The Severity Level Threshold is normally kept to "ERROR" or "WARNING" since the logging system would be swamped if the level would be set to "INFO or "DEBUG" across all services at typical Production volumes.

This normally limits the logging information available when an error occurs to two entries: the "execution start" entry and the "error occurred" entry, with no information in between.
When a component or service instance experiences an error,  it would be useful if the maximum amount of logging information generated prior to the error were made available irrespective of the Severity Level Threshold.
This is valuable even for a BPM solution, which intrinsically provides better tracking compared to a non-BPM solution, as it provides detailed visibility of what happens within process activities.

The technique called Deferred Logging consists in temporarily holding all logging information (of all logging levels right down to "DEBUG"), until the execution of the service/component/process instance terminates, at which point a "flushing" action is triggered:
  • If the instance terminates successfully, the pending logging entries are filtered based on the current Severity Level Threshold .  Only the entries meeting or exceeding the threshold are flushed to the Logging Data Store (e.g., if the threshold is set to "WARNING", then "INFO" and "DEBUG" entries are discarded and not persisted into the store).
  • If the instance terminates with an error, then all pending logging entries  are flushed to the Logging Data Store, this storing a complete "history" of what happened prior to the error.

There are two basic approaches to implementing Deferred Logging:
  • The Logging Publisher component caches the pending logging entries in memory (normally along with the rest of the logging state), and performs the "flushing" action at the end of the instance execution (this is triggered via a special Logging Publisher API method).
  • The Logging Transport is designed to not deliver to the Logging Data Store immediately, but to hold logging entries pending a flush action (or their expiration).  Here, also, the flush action is triggered at at the end of the instance execution via a Logging Publisher API method.  This second option is more complicated but more robust, as pending entries are kept in persistent storage rather than in a memory cache.

Conclusion

Once a flexible logging API is designed, its implementation (the logging solution) can be made more and more sophisticated over time.

The advice is to start small and evolve, based on the supportability requirements that are specific to your organization.
Depending on the requirements, some logging features could be skipped in the first implementation, but it is my opinion that the 3-tier logging architecture presented here (with asynchronous Logging Transport) is a good foundation for a robust and scalable solution.