Wednesday, October 22, 2014

ESB logging (part 1)

Logging is one the fundamental cross-cutting concerns in any software system.

This three-part article will discuss the ESB logging functionality in part 1, will propose some techniques to address these requirements in part 2, and finally address in part 3 the implications of logging in a BPM environment (i.e., when services are orchestrated by a BPMS). 

Functional and technical requirements of ESB logging

In ESB systems, there are several reasons why we need to do logging, the main ones being:
  1. To keep a functional audit of integration activities (sometimes called "transaction logging"); this is essentially a traceability requirement.
  2. To have detailed troubleshooting information available in case of errors or malfunctions
  3. To collect performance information, which helps us identify bottlenecks in our service implementation 
These three points can be considered as very  high level functional requirements for ESB logging.   As always, depending on the specifics, each of these translates into technical requirements that must be appropriately addressed.  The following subsections will go more in depth discussing these three macro-requirements.

Functional audit

The creation of a functional audit trail is normally mandated by business policies and/or by external regulation.  In most cases, it is required to include a copy of of the payload of the service (API) requests and responses.  In other cases, not only the payload but also all control information (in the form of SOAP headers for SOAP services, HTTP headers and status code for REST services, JMS properties for asynchronous JMS services, etc.) must be logged. Logging that includes payloads is sometimes called archiving.
Strictly speaking, the usage of the term "archiving" is not appropriate as the term should be used only in those cases where the system keeps persistent copies of the payloads in normal operation, which is typically the case for B2B gateways.  After a set period, these messages are moved from the "online" data store to the "offline" archive store. 

Looking at the audit requirement more in detail, we need to define:
  • Where to log (audit points)
  • What to log (logging content)
  • How long to keep the logs (retention policy)

For discussing the first point (perhaps the most important of the three), consider the diagram below that shows a rather simple example of an ESB service implemented via a single ESB component, composing native services of two different backend systems but not composing other ESB services:



 The diagram shows six different audit logging points:
  • two at the ESB service endpoint, facing the consumer (logging its request and the response delivered to it) 
  • six facing the backend systems, logging requests and responses exchanged between the ESB (in a service consumer role) and the native endpoints of the backend systems.
As a minimum, audit logging must be done at the two consumer-facing endpoints, but for comprehensive logging all messages that enter and exit an ESB component must be included:
  1. Inbound request from service consumer
  2. Outbound request to native backend service
  3. Inbound response (or fault) from native backend service
  4. Outbound request to composed ESB service
  5. Inbound response (or fault) from composed ESB service
  6. Outbound request (or fault) to service consumer
Note: by "fault" I here mean any structured set of error information, not necessarily a SOAP fault.

Clearly, logging points of type 2 through 4 are due to service composition, so the number of such logging points is proportional to the number of service operations being composed.

It is necessary to correlate the log entries generated at each audit logging point, which can be done in two ways within a component:
  • by logging independently at each audit logging point (generating a separate log entry at each point), but having the log entries share a common unique identifier
  • by grouping the log information generated at each audit logging point into single log entry which is written when the service completes execution (either with success or with a fault).  This can be called delayed logging.
The choice between minimal audit logging (consumer request and response only) and full audit logging should be configurable at the level of service and service operation (or resource / sub-resource level for a REST service), with the usual pattern of configuration inheritance and override between parent and child resource (the setting for service is inherited by service operations unless specifically overridden at operation level).


The second point relates to the content of the log entries.  A minimal set of information fields for the logging function will be suggested in part 2 of this article, but here the important point to remark is that information logged at each log audit point may be heterogeneous, and greatly vary in size.

Often the information is already in textual form (e.g., XML payloads, JSON payload, SOAP headers, HTTP headers), but in some cases we have binary information (e.g., scanned documents).
Even when information is textual, it may not arrive as a String at the ESB endpoint (example: posted HTTP body received as a Java InputStream object), and that mostly depends on the particular ESB tool used.
Sometimes message payloads can be very large, and this must taken into account as well.

Thus, the format and size of the data to be logged has a significant impact on the design of ESB logging, which must be flexible in this respect without introducing excessive overhead.

The third point relates to the retention period of the entries created into the audit log.  Here, like in the case of the determination of audit logging points, it should be possible to configure the retention period hierarchically by service / operation (or parent / child resource), so that information related to more critical operations or resources can be retained longer.   A significant plus, on top of this functionality, would be to allow a retention policy that also takes into account the identity of the consumer.


Technical logging

While the preceding subsection was about audit logging (that is logging that must be done at every service invocation due to functional requirements), this section briefly discusses the additional technical logging that can be done in addition to audit logging to provide useful information about service behavior, which can be essential to troubleshoot obscure malfunction detected in the Production environment.

First of all, let's clear the field from the rather common misconception that mixes together such technical logging and error handling

Technical logging is NOT the same as error handling.  Error handling is ESB logic that must be executed in all cases where an error occurs that requires manual recovery (part of an Incident Management workflow), or that must anyway be notified to a human user (even when the error is recovered from automatically), and the error/fault response to the service consumer is not sufficient.

On the other hand, technical logging, when enabled, is done from  ESB components independently of whether error conditions are happening or not.  It simple traces the execution logic within components at a fine-grained level of detail, providing detailed insight into what happens between the audit logging points.

This does not mean that the implementation of logging can never share anything with that of error handling.  In many ESB environments, for example, designers have decided to make use of the Log4J framework as part of both their logging and the their error handling solutions.

The first consideration about this kind of logging is that, for performance reasons, it should be completely turned off in normal Production operation.  Such logging is normally enabled in environments that support Unit Testing and System Integration Testing, but normally disabled not just in Production but also in every environment that must closely mirror Production conditions, like in environments where Performance Testing happens.

Still, enabling technical logging in a very granular way (e.g. per component or component/consumer combination) can be occasionally invaluable to troubleshoot incorrect behavior in Production.  This must be done, of course, very carefully, due the the performance impact in case a component implements a service that is heavily used.

It is imperative here to have the capability of hierarchical, fine-grained configuration, so the appropriate level of logging can be set very flexibly.

Normally, technical logging should not include payloads but just short dynamic log messages that can contain the values of selected variables in the component's logic.   

An interesting option for technical logging (in between the "on" and "off" settings)  is that of conditional logging:  technical log entries are logged (at the end of the execution of a component, that is with delayed logging), but only if the component experienced an error.  By choosing this option we recognize that detailed tracking of successful service executions is superfluous, which is almost always the case.   On the other hand, "mass" errors, typically due to unavailability of a backend resource, will produce under this option a very large amount of logging information that is utterly redundant (since all service invocations failed for the same reason).

Performance logging (profiling)

Since logging entries (both audit and technical) are associated with a timestamp, it is obvious that logging can be used to collect performance statistics (profiling) for service operations.  The relevant log entries for are only those logged at the audit logging points, for two reasons:
  • The technical logging points are normally too dense within a component
  • Technical logging in Production and during Performance Testing introduces an overhead which is normally unacceptable and that in any case will bias the profiling results (the act of measuring affects the measurements).
The main statistics (min, max, average, standard deviation, etc.) that can be collected are:
  • those related to the distribution of service operation invocations (by time period, consumer, etc.)
  • those related to the total service operation execution time, within the ESB component exposing the operation
  • those related to the execution time of composed services; these, combined with the total service execution time statistics of the services being composed, can provide statistics for message transit times between components (significant when the services being composed are dispersed geographically)



Log entry consolidation, correlation and sequencing

The diagram used above to illustrate the audit logging points shows a very simple situation where all logging points are reached sequentially within the execution thread of the (only) component.  In this case, correlation and sequencing of the produce log entries is trivial.

The situation changes a lot when we have to deal with complex service compositions, distributed ESB components, asynchronous interactions, clustered environments, and services orchestrated by BPM.

Even with a rather simple synchronous composition of ESB services, it not guaranteed that these services run in the same thread:  a "local" service invocation in a clustered environment may executed the service in a cluster node other than that where the caller service executes.

In general, anyway, we must assume that composed services are implemented by components that are distributed in the network, when they are invoked asynchronously, when they are invoked asynchronously via a message broker (for example,  in the case of SOAP-over-JMS invocations) and when they are invoked as business process steps within a process instance.

Each component involved in a service execution makes creates multiple logging entries in the Logging Data Store, and all these entries must all be mutually correlated and sequenced in order to give a clear picture to a person browsing the log.

An example, purposefully a bit involved, is given in the following diagram:


In the picture above a distributed service composition combines functionality from 4 different backend systems.
A main service (implemented on component 1) hosted on machine A directly or indirectly invokes three other services (one hosted on the same machine, and the other two on different machines).  The service hosted on machine C (component 4) is invoked asynchronously through a messaging queue.

Each service component performs logging at its audit logging points and perhaps some components will also have technical logging enabled.   As a result of logging entries can originate from different machines and must be correlated and consolidated  into a single place (the Logging Data Store).  That, by itself, can pose a concern: we can realize upfront that a trivial logging implementation like, for example, using a separate synchronous INSERT to a central database for every individual log entry can be inefficient and hurt our service response times.
 
The Logging Data Store does not need to be physically centralized, but it needs to be centralized logically, allowing a User Interface to access all of its information no matter where it is physically stored.

In order for the audit and logging information to make sense to somebody who browses and analyses the Logging Data Store,  log entries originating from different components that relate, directly or indirectly, to the same invocation by the External Service Consumer must be:
  1. Logically grouped together
  2. Accessible as a whole in a location-independent fashion (i.e., even when not all related entries are store in the same physical data store)
  3. Organized hierarchically based on service composition logic  (e.g., showing the execution of component 3 as a subordinate of the execution of component 2, which is in turn a subordinate of the execution of component 1)
  4. Sequenced correctly (regardless of deviation of the system clocks of the different machines).  This is necessary for proper profiling.

Part 2 of this article will be discussing techniques to help us achieve without affecting (too much) the performance of our services.




1 comment :

  1. Hi Francesco,

    thanks for sharing your view on Logging within an ESB. The details included here are explained in a very accessible manner and cover the fundamental items one should have in mind when designing their logging solution. Very helpful and useful!

    Regards,
    Ana

    ReplyDelete