Wednesday, October 22, 2014

ESB logging (part 1)

Logging is one the fundamental cross-cutting concerns in any software system.

This three-part article will discuss the ESB logging functionality in part 1, will propose some techniques to address these requirements in part 2, and finally address in part 3 the implications of logging in a BPM environment (i.e., when services are orchestrated by a BPMS). 

Functional and technical requirements of ESB logging

In ESB systems, there are several reasons why we need to do logging, the main ones being:
  1. To keep a functional audit of integration activities (sometimes called "transaction logging"); this is essentially a traceability requirement.
  2. To have detailed troubleshooting information available in case of errors or malfunctions
  3. To collect performance information, which helps us identify bottlenecks in our service implementation 
These three points can be considered as very  high level functional requirements for ESB logging.   As always, depending on the specifics, each of these translates into technical requirements that must be appropriately addressed.  The following subsections will go more in depth discussing these three macro-requirements.

Functional audit

The creation of a functional audit trail is normally mandated by business policies and/or by external regulation.  In most cases, it is required to include a copy of of the payload of the service (API) requests and responses.  In other cases, not only the payload but also all control information (in the form of SOAP headers for SOAP services, HTTP headers and status code for REST services, JMS properties for asynchronous JMS services, etc.) must be logged. Logging that includes payloads is sometimes called archiving.
Strictly speaking, the usage of the term "archiving" is not appropriate as the term should be used only in those cases where the system keeps persistent copies of the payloads in normal operation, which is typically the case for B2B gateways.  After a set period, these messages are moved from the "online" data store to the "offline" archive store. 

Looking at the audit requirement more in detail, we need to define:
  • Where to log (audit points)
  • What to log (logging content)
  • How long to keep the logs (retention policy)

For discussing the first point (perhaps the most important of the three), consider the diagram below that shows a rather simple example of an ESB service implemented via a single ESB component, composing native services of two different backend systems but not composing other ESB services:



 The diagram shows six different audit logging points:
  • two at the ESB service endpoint, facing the consumer (logging its request and the response delivered to it) 
  • six facing the backend systems, logging requests and responses exchanged between the ESB (in a service consumer role) and the native endpoints of the backend systems.
As a minimum, audit logging must be done at the two consumer-facing endpoints, but for comprehensive logging all messages that enter and exit an ESB component must be included:
  1. Inbound request from service consumer
  2. Outbound request to native backend service
  3. Inbound response (or fault) from native backend service
  4. Outbound request to composed ESB service
  5. Inbound response (or fault) from composed ESB service
  6. Outbound request (or fault) to service consumer
Note: by "fault" I here mean any structured set of error information, not necessarily a SOAP fault.

Clearly, logging points of type 2 through 4 are due to service composition, so the number of such logging points is proportional to the number of service operations being composed.

It is necessary to correlate the log entries generated at each audit logging point, which can be done in two ways within a component:
  • by logging independently at each audit logging point (generating a separate log entry at each point), but having the log entries share a common unique identifier
  • by grouping the log information generated at each audit logging point into single log entry which is written when the service completes execution (either with success or with a fault).  This can be called delayed logging.
The choice between minimal audit logging (consumer request and response only) and full audit logging should be configurable at the level of service and service operation (or resource / sub-resource level for a REST service), with the usual pattern of configuration inheritance and override between parent and child resource (the setting for service is inherited by service operations unless specifically overridden at operation level).


The second point relates to the content of the log entries.  A minimal set of information fields for the logging function will be suggested in part 2 of this article, but here the important point to remark is that information logged at each log audit point may be heterogeneous, and greatly vary in size.

Often the information is already in textual form (e.g., XML payloads, JSON payload, SOAP headers, HTTP headers), but in some cases we have binary information (e.g., scanned documents).
Even when information is textual, it may not arrive as a String at the ESB endpoint (example: posted HTTP body received as a Java InputStream object), and that mostly depends on the particular ESB tool used.
Sometimes message payloads can be very large, and this must taken into account as well.

Thus, the format and size of the data to be logged has a significant impact on the design of ESB logging, which must be flexible in this respect without introducing excessive overhead.

The third point relates to the retention period of the entries created into the audit log.  Here, like in the case of the determination of audit logging points, it should be possible to configure the retention period hierarchically by service / operation (or parent / child resource), so that information related to more critical operations or resources can be retained longer.   A significant plus, on top of this functionality, would be to allow a retention policy that also takes into account the identity of the consumer.


Technical logging

While the preceding subsection was about audit logging (that is logging that must be done at every service invocation due to functional requirements), this section briefly discusses the additional technical logging that can be done in addition to audit logging to provide useful information about service behavior, which can be essential to troubleshoot obscure malfunction detected in the Production environment.

First of all, let's clear the field from the rather common misconception that mixes together such technical logging and error handling

Technical logging is NOT the same as error handling.  Error handling is ESB logic that must be executed in all cases where an error occurs that requires manual recovery (part of an Incident Management workflow), or that must anyway be notified to a human user (even when the error is recovered from automatically), and the error/fault response to the service consumer is not sufficient.

On the other hand, technical logging, when enabled, is done from  ESB components independently of whether error conditions are happening or not.  It simple traces the execution logic within components at a fine-grained level of detail, providing detailed insight into what happens between the audit logging points.

This does not mean that the implementation of logging can never share anything with that of error handling.  In many ESB environments, for example, designers have decided to make use of the Log4J framework as part of both their logging and the their error handling solutions.

The first consideration about this kind of logging is that, for performance reasons, it should be completely turned off in normal Production operation.  Such logging is normally enabled in environments that support Unit Testing and System Integration Testing, but normally disabled not just in Production but also in every environment that must closely mirror Production conditions, like in environments where Performance Testing happens.

Still, enabling technical logging in a very granular way (e.g. per component or component/consumer combination) can be occasionally invaluable to troubleshoot incorrect behavior in Production.  This must be done, of course, very carefully, due the the performance impact in case a component implements a service that is heavily used.

It is imperative here to have the capability of hierarchical, fine-grained configuration, so the appropriate level of logging can be set very flexibly.

Normally, technical logging should not include payloads but just short dynamic log messages that can contain the values of selected variables in the component's logic.   

An interesting option for technical logging (in between the "on" and "off" settings)  is that of conditional logging:  technical log entries are logged (at the end of the execution of a component, that is with delayed logging), but only if the component experienced an error.  By choosing this option we recognize that detailed tracking of successful service executions is superfluous, which is almost always the case.   On the other hand, "mass" errors, typically due to unavailability of a backend resource, will produce under this option a very large amount of logging information that is utterly redundant (since all service invocations failed for the same reason).

Performance logging (profiling)

Since logging entries (both audit and technical) are associated with a timestamp, it is obvious that logging can be used to collect performance statistics (profiling) for service operations.  The relevant log entries for are only those logged at the audit logging points, for two reasons:
  • The technical logging points are normally too dense within a component
  • Technical logging in Production and during Performance Testing introduces an overhead which is normally unacceptable and that in any case will bias the profiling results (the act of measuring affects the measurements).
The main statistics (min, max, average, standard deviation, etc.) that can be collected are:
  • those related to the distribution of service operation invocations (by time period, consumer, etc.)
  • those related to the total service operation execution time, within the ESB component exposing the operation
  • those related to the execution time of composed services; these, combined with the total service execution time statistics of the services being composed, can provide statistics for message transit times between components (significant when the services being composed are dispersed geographically)



Log entry consolidation, correlation and sequencing

The diagram used above to illustrate the audit logging points shows a very simple situation where all logging points are reached sequentially within the execution thread of the (only) component.  In this case, correlation and sequencing of the produce log entries is trivial.

The situation changes a lot when we have to deal with complex service compositions, distributed ESB components, asynchronous interactions, clustered environments, and services orchestrated by BPM.

Even with a rather simple synchronous composition of ESB services, it not guaranteed that these services run in the same thread:  a "local" service invocation in a clustered environment may executed the service in a cluster node other than that where the caller service executes.

In general, anyway, we must assume that composed services are implemented by components that are distributed in the network, when they are invoked asynchronously, when they are invoked asynchronously via a message broker (for example,  in the case of SOAP-over-JMS invocations) and when they are invoked as business process steps within a process instance.

Each component involved in a service execution makes creates multiple logging entries in the Logging Data Store, and all these entries must all be mutually correlated and sequenced in order to give a clear picture to a person browsing the log.

An example, purposefully a bit involved, is given in the following diagram:


In the picture above a distributed service composition combines functionality from 4 different backend systems.
A main service (implemented on component 1) hosted on machine A directly or indirectly invokes three other services (one hosted on the same machine, and the other two on different machines).  The service hosted on machine C (component 4) is invoked asynchronously through a messaging queue.

Each service component performs logging at its audit logging points and perhaps some components will also have technical logging enabled.   As a result of logging entries can originate from different machines and must be correlated and consolidated  into a single place (the Logging Data Store).  That, by itself, can pose a concern: we can realize upfront that a trivial logging implementation like, for example, using a separate synchronous INSERT to a central database for every individual log entry can be inefficient and hurt our service response times.
 
The Logging Data Store does not need to be physically centralized, but it needs to be centralized logically, allowing a User Interface to access all of its information no matter where it is physically stored.

In order for the audit and logging information to make sense to somebody who browses and analyses the Logging Data Store,  log entries originating from different components that relate, directly or indirectly, to the same invocation by the External Service Consumer must be:
  1. Logically grouped together
  2. Accessible as a whole in a location-independent fashion (i.e., even when not all related entries are store in the same physical data store)
  3. Organized hierarchically based on service composition logic  (e.g., showing the execution of component 3 as a subordinate of the execution of component 2, which is in turn a subordinate of the execution of component 1)
  4. Sequenced correctly (regardless of deviation of the system clocks of the different machines).  This is necessary for proper profiling.

Part 2 of this article will be discussing techniques to help us achieve without affecting (too much) the performance of our services.




Sunday, October 12, 2014

Message Payload Content Models: specific or generic (part 2)

This post continues the previous one (part 1), and presents further alternatives for modeling payload data structures for ESB event or services, using XML and XML Schema.

The second alternative of part 1 was called "Entity-specific definition and derived types", and relied on the xsi:type attribute for two purposes:
  1. To define a restricted type for a generic entity-related XML Schema type (PurchaseOrderType in my example), which can be used for validation and otherwise
  2. To refine the semantics of the message in the context of the Business Object (the Purchase Order), for example specifying whether it represents a release, a cancellation, a modification, etc.
The payload looked like this:

This approach has several benefits, but its biggest disadvantage is that, like the "Fully specific definition" approach, it still requires the definition of a large number of schema types:  the generic, "canonical" type (normally by Business Object), plus a potentially large number of restricted types (by BO/event type or by BO/service operation).

The next (more generic) approach to be examined still uses an entity-based "canonical" definition, but without any derived types.


Entity-specific definition with separate validation

we want to use the generic "canonical" type without the restricted schemas used under the previous approach.
Not using anymore the standard xsi:type attribute (which was necessary to identify the restricted schema type),  the schema must now define an explicit custom attribute to identify the semantics of the message.  In this example, the attribute is named eventType:



















Below is a sample event payload conforming to this schema, for a Purchase Order cancellation event:











As we can see, the message is practically the same as for the previous approach, just with a different attribute and without the XML Schema Instance (xsi) namespace declaration.

All considerations regarding mapping that apply to the previous approach also apply here.  All integration logic that has to do with purchase orders will use instances of the common PurchaseOrder document, mapping in each case those optional elements that pertain to the functional context at hand (the functional context identified through the value of the eventType attribute).

However, under this new approach, we do not have the event-specific derived types anymore.  In our example, we no longer have the PurchaseOrderCancellationType definition that enforces  that  CancellationDate and CancellationReason as mandatory elements when eventType="PurchaseOrderCancellation".  The inability to validate the payload based on semantic context would a serious functionality gap.

To fill this gap, we need ways to specify validation logic separately from the schema without writing custom validation code in the ESB layer (which is awkward and not very maintainable).  We need is to specify constraints and validation rules in a declarative fashion.

In our example, the constraint is simple: the  CancellationDate and CancellationReason elements must exist if the eventType attribute of the PurchaseOrder element has a value of "PurchaseOrderCancellation".   Nevertheless, there are many several other types of constraints that would be useful for validation in a semantic context, which cannot be always expressed using XML Schema.  They are called co-constraints in the literature.  Even more complex validation constraints can be those of an algorithmic nature (for instance,  checking that for each order line the value of its GrossAmount element is the sum of the values of its NetAmount and TaxAmount elements).

Granted, one can easily get carried away and specify all kinds of complex business rules to be enforced, which brings to the important debate on whether or not (and to what degree) it is appropriate to enforce business rules at the ESB layer (rather than at the level of provider applications and resources).  I'll leave this discussion for another blog post and focus on the simple need of validating mandatory elements by semantic event type, which is a totally legitimate need at the ESB level.

XML technologies come to the rescue to check co-constraints and algorithmic constraints.  Technologies like Schematron and Relax NG have been available for several years in the form of libraries that can be easily used by an ESB product.  However, the most common technology to implement this logic is XSLT (on which by the way most Schematron implementations are based).

A simple XSLT stylesheet to check the presence additional mandatory elements (CancellationDate and CancellationReason) for a PurchaseOrderCancellation event could be the following:









Its logic should be clear:  if the PurchaseOrder document contains the two elements, then the output resulting from the transformation will empty (= validation OK), otherwise the output will contain one or more validation errors.
The ESB must, after parsing the document and  validating it against its canonical schema (which defines which elements are mandatory in all scenarios), dynamically invoke the appropriate validation stylesheet based on the runtime value of the attribute that defines the semantic context (eventType in our case).

The dynamic invocation of an XSLT transformation is simple to implement in most ESB tools and has the following advantages compared to using restricted XML schema types:
  • XSLT validation is much more powerful than XML schema validation to express co-constraints and can even express algorithmic constraints if really required
  •  The stylesheets can more compact than XML schema restricted type definitions (which must repeat all mandatory elements of the base canonical type)
  • XSLT stylesheets can be changed and deployed independently of ESB code and XML schemas (which in most ESB tools get "hardwired" into the ESB code at design time), bringing more flexibility to change validation rules without re-deploying whole ESB components
The disadvantages  essentially consist in some additional complexity (XSLT learning curve) and the performance hit of the XSLT transformation, but this approach is anyway a viable one in most cases. 


Fully generic payload definition

In business environments where data structure definitions must be highly flexible (for example, in R&D departments),  people tend not to want to be bound to fixed schemas for message payloads, but instead favor a fully generic payload structure that can cover arbitrarily composed data entities.

It is true that XML Schema allows open-ended definitions via the <xsd:any> construct, but that has limitations (it cannot be preceded by any optional element definitions)  and is not supported by many ESB tools, so it is not considered here.

A very simple but extreme way to implement a generic data structure is to use a collection of key value pairs (equivalent to a hash map) qualified by a semantic context (eventType).  A simple definition of such a generic structure in XML Schema could be the following:





















A sample instance of this schema (again for the PO cancellation case) is:





Now, the schema definition is no longer entity-specific, and the value of the eventType attribute must convey the complete semantic context of the message.

What are the implications of this approach?

On the plus side:
  • There is just one very simple canonical schema, which does not even need to be versioned (as opposed to any other type of schema)
  • All event and service interface just use the common generic schema
  • Mapping from instances is  easy using XPath (looking for just for the elements that are appropriate in the functional context and ignoring the remaining content)
  • Dynamic validation based on functional context can still be done using XSLT (or Schamatron, or Relax NG), like for the entity-based approach above.  In XSLT we could have (note that now all mandatory elements must be validated via the stylesheet):

 On the other hand, the fully generic approach also has disadvantages :
  • Since service contracts and event definitions all use a common generic schema, they are not self-describing in terms of the data they exchange. 
  • Most importantly, components using a generic schema do not have message formats enforced at contract level, but everything is enforced at implementation level.   Service consumers, service providers, and event subscribers must be aware at runtime of what data model is coming their way.  In practice a version attribute is mandatory (in addition to eventType) so that a message recipient can apply the appropriate validation and mapping based on the combination of eventType and version
  • Mapping to a fully generic format is less convenient and cannot leverage the drag-and-drop mapping facilities available in virtually all ESB tools.  One needs to call a custom built map(key, value) helper method to add each KVP element to the payload being mapped
  • Mapping complex, hierarchical documents is more difficult.  In practice the easiest approach is to define hierarchical keys (e.g., under a PO document "/items[10]/productId" could identify the product ID of the 10th line of the order)
These points should be carefully taken into consideration before going with the fully generic approach: do you really need that degree of flexibility?


Conclusion

The choices made in the area of message modeling have a major impact on your SOA and your ESB designs.  Be sure to consider all pros and cons in the context of your enterprise (degree of agility required, business processes, existing data models, features of available tools).


Saturday, October 4, 2014

Message Payload Content Models: specific or generic (part 1)

When modeling payload data structures for messages, events, services interfaces, or API interfaces, there are some hard choices to be made, which have implications on simplicity, reuse, readability, performance, ease of validation, and ease of data mapping.
A very basic and important choice has to do with the specific vs. generic nature of the schemas we define.

A  payload schema that's highly specific defines the exact set of data fields required by a service or API signature, and in the case of an event it defines the exact set of data fields that are relevant for that event.
This being said, the actual degree of specificity of the payload schema depends on how semantically specific the service/API/event is:  there is a big difference, for instance, between a CustomerUpdated event and a CustomerBlockedForBadCredit event.  The content model for the first event may potentially include a very large set of customer master data elements (as it is not a semantically specific event), while the content model for the second could just contain the identification of the customer, and credit related data.

On the other hand, a completely generic schema would just define just an open, flexible set of key/value pairs whose composition varies based on the service/API or event type it is used in, with no semantics implied by the schema definition per se.

This  article (divided in two blog entries, of which this is Part 1) will consider a number of options, going from the most specific to the most generic side of the spectrum, using Purchase Order (PO) events as examples.  For example, we want to model on our ESB an event carrying the information that a certain purchase order has been released for sending it to the vendor, or another event that the PO has been cancelled (before the vendor has shipped it).

In the examples of both Part 1 and Part 2 XML data models will be used, omitting the use of namespaces in order to make the examples simpler to read.  For the same reason, technical envelopes with control fields have been purposefully omitted, although real ESB architectures normally define custom envelopes.
However, element field names are standardized across the examples (as they should be, see article The Importance of Data Glossaries).
Despite the fact that I use XML and XSD here, high level considerations may apply to other formats as well. 

In this first part of the article, I present two alternatives, which both lean on the "specific" side, leaving the more generic alternatives to Part 2.


Fully specific definition

A straightforward approach to model a PO Release message would be to use a payload like this:

The root element name in the event payload identifies at the same time the business object to which the event happens and the nature of the event. This is very specific, although the payload does not include much data in the event message beyond the identification of the PO.



To enrich the event with  detailed approval data (e.g., for each approval level:  approval date, approver user id,  approver remarks) it would be necessary to call query service (Content Enricher pattern).  The choice between infomation-rich event payloads and "essentials" events would be the subject of a separate discussion.

We would of course define an XML Schema  resource in our SOA repository that would include something like the element definition below:




This is excerpt from complete XSD file, with type definitions simplified (in practice most of them would refer to separate simpleType definitions).






Here all child elements of PurchaseOrderSendReleaseEvent are mandatory, as we intend this information to be supplied by the event publisher for every single event of this particular type.  The schema could of course define additional optional elements, but nevertheless the content model is tailored to the specific combination of business object and event.
A similar schema for the request message of an update service operation could enforce that all required inputs are present for the operation, as they correspond to mandatory child elements.

Let's see how this approach fares against the criteria listed at the beginning of the article.

The approach is clearly extremely simple, it has very good readability and ease of validation (it is clear from the schema definitions which elements are mandatory in which functional context, and standard XSD validation can be readily applied).

The main drawback is that it scores low in reuse:  it is possible to share the simpleType definitions for the individual data elements but nothing more.  It is necessary to define and manage many schemas, due to the high number of Business Object / event combinations  (for events) and specific service operations (for services).
From the point of view of data mapping, mapping is straightforward, but necessarily many similar mappings need to be maintained for each individual schema.  Here, too, there is no reuse and maintenance is  heavy.
Performance is good as only data related to the specific event is included into the document.


Entity-specific definition and derived types

A first step in the direction of having a more reusable payload is to use a common root element that only reflects the Business Object affected, while the event (or service operation) could be expressed via an attribute value.

With XML, we can take advantage of the standard xsi:type attribute of the XML Schema to indicate the event type (avoiding the use of a custom attribute for this), and we can have the two payloads instances below for a PO release and for a PO cancellation, respectively:













Here, although the two payloads are composed differently, we would like to use a common schema for mapping and ESB broker publication, while retaining the possibility of validating each payload against a restricted schema that is specific to the event type.

The schema definition could contain a generic PurchaseOrderType, having as its mandatory elements only those elements that are populated in every possible PO integration scenario (e.g., PONumber, CompanyCode, POType, VendorNumber), and the other elements defined as optional, as they may be present in one particular scenario but not in others. 


NOTE: this being an example, only three optional elements are shown.  But in a real case there would be a very large number of optional elements.

 





From the above generic or "canonical" definition of purchase order, we can derive restricted schema definitions for each PO event or service based on the generic schema.  For example, for events SendRelease and Cancellation:



 

 

 

 

 

 





Compared with the fully specific schema, this approach doesn't seem to have any advantages at first sight:  it still uses a restricted complex type definition for every PO event or service, plus the added generic complex type PurchaseOrderType

However, in most ESB implementations (e.g., Software AG webMethods), it would be possible to perform mapping and broker publication against the generic (canonical) schema definition and additionally:
  • have broker subscribers based on the same generic schema with filters based on xsi:type
  • "downcast" an instance of the generic type into an instance of a restricted type via a single map operation (without having to map field by field, which is error-prone and inefficient)
  • dynamically validate a payload instance against a restricted type
  • extend the generic (canonical) type with additional optional fields with no  impact on the definition of the derived types,  the event / services definitions using the derived types, and the existing mappings between generic type and derived types
Thus, this solution is more flexible than the fully specific definition, while keeping similar levels of readability and performance on the wire.  Schema extensibility is good, as it is possible to decouple extended versions of a restricted type from extended version of the corresponding base type (as long as the derived type version uses elements that are also defined in the base type version, of course).

The concern remains about having to manage a large number of derived schemas, but with an important difference.
The availability of the generic (canonical) schema definitions makes it much easier in this case to automatically generate the derived schemas based on simple specifications.  It would not be too difficult to develop a program that reads the XSD for a generic schema from the SOA repository and also read some kind of data store (as simple as a property file or as sophisticated as a custom DB) that holds the information regarding which elements are mandatory for which derived type, and thus automatically generate the schema definitions for the derived types in the repository.


Conclusion

The two alternatives presented here offer very good functionality for mapping and validation in exchange of maintenance burden (also into account the practical necessity to version the schema definitions).

The upcoming part 2 of the article will describe alternatives that offer a different tradeoff.