Wednesday, October 5, 2016

A categorization for error handling and recovery

Proper handling of error situations is an absolute must for any production-grade integration solution.

This post attempts to categorize the significant cases that occur in the domain of non-BPM integration, considering both synchronous integrations (API, request-reply services, etc.) and asynchronous integrations (pub/sub, polling, etc.)

For each category we describe what features the integration solution should possess to achieve resiliency to errors and ease of error recovery (which is necessary for production support). This post does not discuss in any detail how these features can be realized in ESB tools:  these topics will be addressed in future posts.



Categorization criteria

First of all, the categorization criteria must be clearly presented.

Type of integration (synchronous vs. asynchronous)

Synchronous integrations (such as API's) differ from asynchronous integrations in one fundamental aspect: the invoking client cannot wait long for a response.

This means that in the case of temporary unavailability of a target resource we cannot in general wait and retry (after a delay) while the client code hangs waiting for our response.
Waiting would create usability and performance issues on the client side and the SLA of our API would be probably breached anyway.

 A synchronous integration (i.e., one that follows the Request-Response exchange pattern) must in almost any case immediately return an error response to the client, regardless of the type of error. Retrying is normally not an option unless the retries are attempted within a very short (sub-second) time interval, which makes them less useful as the transient issue is likely to last longer.

In case of a transient error (see below) it is important that the error response can clearly indicate that a retry by the client is desired, as the client always bears the responsibility for retrying the invocation of synchronous integrations.
For a REST API, the more appropriate HTTP status code for transient errors is 503 Service Unavailable, ideally accompanied by a Retry-After response header.


Asynchronous integrations have the big advantage of looser coupling but involve complications, one of which being the frequent requirement of guaranteed delivery to the target resources.

Once our integration fetches a message from a source JMS queue and acknowledges it, or sends  HTTP response status 202 Accepted to a REST client (just two examples), then it accepts delivery responsibility.
An async integration should always automatically retry updates in the face of transient errors (see below).

If the error is not transient, the source message cannot be dropped, but must be parked somewhere for later analysis and reprocessing.  The only exception to this rule are non-business-critical messages (advisory, stats, etc.)

While for sync integrations error reprocessing is invariably initiated from the client or source system, for async integrations this is not necessarily true: the integration solution must often provide usable reprocessing capabilities.


Type of update

It is useful to distinguish three cases with respect to the type of target resources:
  1. Single resource
  2. Multiple resources (transactional) 
  3. Multiple resource (transaction non possible)
The first case is clearly the easiest: since there is just one (logical) target resource affected, the update is intrinsically atomic.  At worst the outcome of the update may be in doubt (see below).

If the integration updates more than one target resource, it may be possible to have a transaction span these updates.  This is typically the case when multiple tables must be updated in the same relational DB, or when we need to send to multiple target JMS destinations. 

In some cases, distributed (XA) transactions across heterogeneous resources may be viable (please bear in mind, however, that XA transactions can have adverse scalability and performance impacts).

A key advantage of transactionality is the automated rollback that takes place in the event of an error, ensuring state consistency across the target resources.   The only exception is the unlucky case of an "in doubt" transaction.

Finally, we have the most tricky case in which it is not technically possible to encompass the multi-resource update within a single transaction.  Thus, if it is required to ensure mutual consistency across related resources at all times, the integration must attempt compensating updates to undo the partial update committed before the error occurred.


Type of error 

When discussing error handling, several categorizations are possible, for example:
  • "technical" errors vs. "functional" errors
  • "data-related" functional errors vs. "configuration-related" functional errors
  • "explicitly-handled" errors vs. "unhandled" errors 
  • "transient" errors vs. "non-transient" errors
Not all of these categorizations are equally useful.  In particular the technical vs. functional distinction is often fuzzy.
Some errors are clearly of a functional nature:
  •  Data-related:  errors due to incorrect or incomplete data in request messages;  any reprocessing with the same input is doomed to fail again,  and only a corrected source message can go through successfully;
  •  Configuration-related: errors due to incorrect or incomplete configuration in a target system (e.g., necessary master data missing) or in the ESB (e.g., s missing cross-reference entry); in this case resubmitting the same source message can succeed as long as the relevant configuration has been corrected/completed in the meantime
However there is almost always a "gray area" in addition these cases where it is difficult to classify and error as technical or functional.
To simplify things, many practitioners call "technical errors" only those errors that are due to unavailability or unreachability of a target resource (i.e. the same of a "transient error"), and say that all other errors are "functional".

For the sake of this discussion, I distinguish only three cases:
  1. Transient error:  the target resource is momentarily unavailable or lacks connectivity, but it is expected that this situation will not last a long time; we definitely know that the update we attempted did NOT take place due to the unavailability / connectivity issue;  these errors manifest themselves are transport exceptions at runtime.
  2. Non-transient error:   EVERY error other than the transient ones where we definitely know that the update we attempted did NOT take place (typically because the target resource returned an error response).
  3. In-doubt error situations: the update attempt produces a TIMEOUT response but we are not 100% sure that the update did not go though.  Also "in-doubt" transactions fall into this category.
"Ordinary" integration solutions, which are not using automated diagnoses via a rule base or similar, are only able to automatically recover from transient errors by retrying the update periodically.
Indiscriminate automated retry of failed updates for any kind of error is a recipe for infinite loops whenever errors are systematic (as in the case of a data-related error: repeated submission of the same bad data is pointless).

Even in the case of transient errors the number of times the integration retries is normally not indefinite.  Most of the time a sensible retry limit is configured: beyond this limit the error becomes non-transient.


Idempotency of target resources

Idempotency is the capability of any resource (target resource or integration endpoint) to recognize and reject duplicate requests or messages.

Idempotency is a precious asset for production support: it allows us to retry an update when in doubt without worrying about the risk of a duplicate posting.  If we happen to submit something twice, the resource will reject it, thus making the operation safe.

Technically, a target resource can implement idempotency by permanently storing an unique "update key" along with every committed update.  If a request comes in with an update key value that's already found in the store, then it is ignored as a duplicate (without raising and error).
This seems easy but there's more to it than meets the eye: for the whole thing to work, the "contract" that we expose for our integration source(s) (e.g. API clients or source systems) must include a correct definition for our update key.
In fact, this concept and its correct implementation are so important that I will devote another blog post only to the subject of idempotency (plus the issue of "stale update rejection", i.e. rejecting out-of-date requests).

If a target resource is not idempotent, then it is possible to build an idempotency layer around it (or in front of) in our integration logic (as described later in this post).



Error handling / recovery features in ESB

I illustrate in this section the different features for error handling and recovery that may be built into an integration solution.
This prepares for the next section ("Analysis or relevant cases"), which explains which capabilities are necessary based on the categorization criteria defined above.

It is worthwhile noting that these features need to be applied consistently to many integrations and are therefore to be treated as "cross-cutting concerns" (in AOP parlance).
The integration / ESB platform must include shared components and utilities to support these features consistently across the integration landscape.


Automated update retry

Automated retry logic can attempt an operation (we are talking about updates here, but it is applicable to queries as well) until its succeeds or up to a set maximum number of times.

Update attempts are separated by an configured time interval, and sometimes it is possible to have retry intervals become longer at each unsuccessful retry (exponential backoff, for example retries after 5s, 10s, 20s, 40s etc.)

As stated earlier, automated retries are only normally applied to errors that are surely transient (in practice, only to transport errors).  If the configured maximum number of retries is exhausted, then the error becomes non-transient and must be handled as such.

In some cases it is justified or even required to build logic to identify additional standardized error types for which recovery errors can be automated as well.  This is done by matching error responses against a configured "error rulebase".
This more sophisticated approach also requires that detailed audit logging is kept of all integration actions (original update attempt and all retries).  Otherwise it is going to be difficult for support staff to figure out what happened if something went wrong and the automated logic could not recover.

 

Idempotency at ESB level

As stated earlier, if a target resource is not idempotent, it is possible to build idempotency into the integration. 

For each incoming request message,  a functionally suitable unique "update key" must be extracted from it and inserted into a persistent "duplicate detection" store with fast read access (a conventional RDBMS may be used but a NoSQL solution such as Redis is better). 

If the update fails, the entry is removed from the duplicate detection store,  otherwise (if the update was OK) the entry stays in the store for a "reasonable" period of time (beyond which the chance of duplicates becomes acceptably low).  At each execution of the integration, the duplicate detection store is (efficiently) checked for the existence of the update key.  If the key is found then the update is skipped, otherwise the key is inserted into the store and the update is attempted.

In-doubt updates needs special treatment: it we are not sure whether the update went through or not, then the entry in the duplicate detection store (for the update key) must be marked as "in-doubt".  When a new update request later comes is for this same update key, the target resource must be queried to ascertain whether the posting actually existed in the first place, in order to decide whether to repeat the update or not. 

Duplicate detection stores always have a retention period so old entries are purged. The advantage of a data store like Redis is that there is no need for a periodic purge procedure since entries can be inserted with an expiration period enforced by the DB itself.


Transactionality

If an update must encompass multiple resources and transactional update is possible, then transactions should be used by the ESB to ensure consistency of state.

This guideline always holds when dealing with "homogeneous" resources (e.g.,  tables in the same DB,  JMS destinations in the same JMS provider, etc.).  Instead, if distributed (XA) transactions would be required, then the data consistency advantages that they bring should be carefully weighed against complexity, performance, and scalability considerations.

Lastly, it is important to remark that using transactions does not remove the need for idempotency because:
  • Is is still a good idea to guard against the risk that a client or source system sends a duplicate request by mistake, even if the original request was processed correctly
  • In the case of an "in-doubt" transaction we must be able to reprocess without concerns

 

 Compensation

As mentioned above, compensation logic is required when we need to keep related resources mutually consistent at all times and transactions are not an option.
This is far from an ideal solution because, among other things:
  • compensating updates (or "compensating transactions" as they are often called) add complexity and may in turn fail; if the original update failed for technical reasons then compensation will also most likely fail;
  • many systems of records (such as ERP systems) cannot delete a committed transaction such as financial booking, so the cancellation must create an offsetting posting, that exists purely for technical reasons
Therefore, compensating transactions are often avoided provided that the update that's "missing" due to the failure can be made to succeed within a reasonable time (via error reprocessing).
In other words: if the purpose of our integration is to update resources A and B in a coordinated fashion, and only resource B could not be updated, then our purpose is to update B as soon as we can as long as the temporary inconsistency due to the fact that A is update and B does not have a real business and/or regulatory impact.  We should use compensation only when even a temporary data inconsistency is unacceptable .


Error notification 

Generation, routing, and dispatching of error notifications or alerts are part of basic integration infrastructure, and the way they are implemented must of course fit into the organization's Incident Management process.

Please note that error notification is a functionality that is logically separate from logging (the latter is always necessary for both sync and async integrations).   From a technical standpoint,  error notification is frequently triggered from the logging logic (typically based on the severity of the logging entry), but logically it is distinct due to its "push" nature.

Many implementations are possible among which:
  • direct notification  to a support group (via old fashioned email or more modern channel such as Slack)
  • indirect notification though error logging and delegation to a log analysis / operational intelligence tool like Splunk
  • direct injection into an incident management / ticketing system  
Each of these approaches has pros and cons (which may be discussed in a future article), but for sure we must have the notification functionality covered in some way.

For synchronous integrations, error notification is limited in most cases to returning a useful error response to the client.  Good practices are easily accessible online for this (see for example the error response of the Twilio REST API).  Notifying individual errors to Support is not normally done for synchronous interactions, as it is the client's responsibility to act on every error.
Nevertheless, logging and monitoring are key to immediately identifying technical problems that affect the health of our APIs and their adherence to SLA's.  API gateway products can really assist here.

With asynchronous integrations, every non-transient error normally triggers an error event that must be suitably routed and dispatched.  Many criteria can be used for routing and dispatch of such error events for example:
  • the identity of the integration
  • organizational information in the context of the integration that failed (e.g.,  company code)
  • the priority of the source message (regardless of how its is determined)

Sophisticated error handling logic may be able generate a single notification for a burst of errors of the same type that occur within a given (short) period of time ("error bundling").  This kind of logic is essential if we want to directly integrate error events into out ticketing system, and avoid flooding it with a large number of identical errors as a result of an outage. 

 

DLQ and Error Recovery

Most asynchronous integrations are "fire-and-forget" from the standpoint of the source system: if a message can be successfully sent, the source system assumes that the integration logic will take care of the rest.  No response is expected.

Even when asynchronous ACK messages are interfaced from the ESB back to the source system,  they are mostly used for audit only.

Since in the systems integration domain we are mostly dealing with business critical flows,  every single non-transient error (including transient errors for which retries have been exhausted) must be notified (as explained above) and "parked" for later error analysis and recovery.

All information for error analysis and recovery must be placed in a persistent store that is commonly called a Dead Letter  Queue (DLQ), although its does not really have to technically be a queue.

Some considerations:
  • it must be possible to browse the DLQ by support staff
  • if must be possible to quickly find the DLQ entry from the contents of an error notification (which ideally should contain some kind of URL pointing to the DLQ entry)
  • the DLQ entry should link to all available information to diagnose the error including logging entries created during the execution that led to the error (see for more info my earlier blog post on logging)
  • in order to be able to replay a failed integration  the complete runtime context must be available in the DLQ entry and not just the message payload.  For example, an integration invoked via HTTP normally needs URI, query parameters, request headers etc., just re-injecting the HTTP body will not work in general.  An integration triggered through JMS will need the JMS header etc.
The topic of error recovery as part of Incident Management is not a trivial one and cannot be be fully covered here.  However, a few points can be noted:
  • As already stated, the reprocessing a synchronous integration can only be initiated by the client which originally made the failing invocation.
  • In non-BPM integration solutions, it is practically impossible to resume an integration execution halfway, as there is no way to resume execution from a  reliable persisted state.  This possibility instead exists when using process-like ESB tools such as TIBCO BusinessWorks  (with proper checkpointing implemented).
  • For asynchronous integrations, reprocessing on non-transient errors can in theory occur either from the source system or from the ESB's DLQ.  However, the correction of data-related errors requires submitting a corrected (altered) source message and this should not be the responsibility of the team supporting the ESB (no manual manipulation of production messages should be allowed!)  Therefore, the reprocessing of such errors needs to be initiated in the source system.
  • Idempotency is a necessary pre-requisite for reprocessing when partial updates have occurred that could not be rolled back or compensated for. Example: if an error was thrown on updating target resource B after resource A was successfully updated, then it is not possible to replay the integration execution as a whole unless we can rest assured that no duplicate update of A will be made.
  • The risk of "stale updates" is always present when reprocessing master data synchronization messages.  Example: if a message synchronizing product data for product X fails (for any reason) and is parked in the DLQ, we should not reprocess this message if more up-to-date messages were successfully interfaced for the same product, otherwise we would be updating the target with old data. 

 

Analysis of relevant cases

This section matches the categorization of integrations / resources / errors with the necessary error handling / recovery feature. 

To make things more understandable, the following decision tree can be used:

The coding clearly shows what feature are required in which case.

The rationale can be inferred based on all the discussion presented so far, but is still worth while emphasizing the key points for synchronous and asynchronous integrations.

 

Synchronous integrations

Automated update retry capability on transient errors is normally not applicable (reason: most calling clients cannot afford to hang and wait for seconds for a response without serious repercussions on their side).

No DLQ is necessary as reprocessing must be initiated from the source.
Logging and monitoring are necessary although error notification is normally not done in addition to the required error response to the caller.

Idempotency (whether supported natively by the targets or added in the integration layer) is always necessary when the integration update multiple resources and transactions are not possible.
Even in the other cases, idempotency is useful when "in-doubt" situations arise.

Finally, compensating actions are to applied only when transactions are not possible and at the same time it is mandatory that target resources keep a mutually consistent state at all times.

Asynchronous integrations

As its is apparent from the figure above, asynchronous integrations require as a rule a larger set of features related to error handling and recovery.

The first obvious difference with the synchronous case is the use of automated retry logic to mitigate transient errors.

The second big difference is the necessity of some form of (more or less sophisticated) error notification to alert the appropriate actors (support, key users) that something went wrong.

Thirdly, Dead Letter Queue functionality is often put in place to allow repeating the execution of a failed integration with the original input data ("resubmission" or "reprocessing").  Such functionality can only be forfeited if we decide at design time that all reprocessing will always occur from the source system, but is still advisable to still have the DLQ resubmission option available.

Finally, the considerations about idempotency and compensating transactions essentially stay the same as in the sync case.


Conclusions

Hopefully, this article will help  practitioners rationalize integration design decisions based on the technical and functional context at hand.

However since the integration landscape for medium-sized and large enterprises is virtually guaranteed to be wide and varied, it highly advisable that we endow our integration platform with all key capabilities (automated retry, idempotency, transaction support, error notification, DLQ, etc.) in a generic and reusable fashion.
This way, we will be able to easily build in a given capability into an integration when we need it.

Future posts will address how some of these capabilities can be implemented with ESB tools.



Tuesday, September 15, 2015

Structured Mapping Specifications

Data mapping and transformation constitute a key part of the vast majority of integration solutions.
Notable exceptions are the cases where the same data model is shared across the whole integration, from service consumers to backend systems (a case not very frequent in practice), and the case of pure file transfers.

In virtually all other cases there is always the need to map source messages into target messages.

A high-level view of mapping

Below are a few definitions and assumptions to define the context of this article.

A Message is a tree-structured data container modeled after the Message Enterprise Integration Pattern.
All practical message formats (XML, JSON, flat file, EDI file) can be parsed into a tree-structure representation.

A Simple Message Mapping
  •  transforms exactly one source message into exactly one target message (1-1 mapping)
  •  can achieve both structural and value transformations (value transformations can include any function including value computation and value cross-referencing, for example)
  • only uses information that is present in the source message or in the integration environment (e.g., timestamp generation, UUID generation, use of environment properties, etc.)
A Complex Message Mapping can be of these different kinds:
  • 1-N mapping: multiple target messages are produced from the same source message. This case is trivial as it can be simply expressed as the "parallel" combination of multiple simple (1-1) message mappings.
  • N-1 mapping: in this case multiple source messages are transformed into a single target message.  The source messages can result from a message join or correlation (typically in a BPM solution), but can also include secondary source messages produced as output by data enrichment functions that are invoked based on a "main" source message as input.  Without analyzing this case in detail (this may be the subject of a subsequent post) one can be intuitively grasp that such a N-1 mapping can be implemented as a "network" of 1-1  mappings (some of while providing the inputs for one or more Data Enrichers).  Once the source messages are defined, a mapping rule can be expressed by qualifying each source field with the source message it belongs to.
  • N-M mappings:  these are the most general as they produce M target messages from N source messages, can be defined as a set of N-1 mappings. 
The rest of the article discusses ways of specifying  simple message mapping only, leaving the topic of complex message mapping specification to a future post.


Mapping specification issues

Traditionally, mapping specifications are being delivered to integration developers by analysts in the form of mapping spreadsheets (the so-called Mapping Sheets), and integration developers need to figure out how to best implement the specification using a combination of tools, based on the integration platform they are working on (tool specific visual mapping configuration, Java code using mapping utility classes, XSLT sheets, etc.)

Almost always, these mapping specifications are ambiguous to a certain degree. Thus, they require multiple subsequent clarifications with the analyst and/or they are misinterpreted by the developer (so it turns out during integration testing that the implemented mapping was not what the analyst intended and needs to be reworked).

This article suggest some ways to improve on this situation.

Let's look first at the way mapping sheets are generally structured, and some reason why they are ambiguous (I do not have the ambition to be exhaustive here, as people can be very creative when it comes to being unclear with their specs).
What we generally find are Excel sheets structured in three sections, from left to right:
  • Source: specifies the source field for the mapping  (generally via some XPath-like notation)
  • Transformation: specifies the transformation logic (if any, beyond simple value copy) necessary to obtain the value for the target field from the value of the corresponding source field
  • Target:  specifies the target field for the mapping  (gain, mostly via some XPath-like notation)
This structure is more than adequate for simple cases, where the source and target messages are simple records (without repeating sub-records) and each target field is obtained either from a single source field or is hardcoded (hardcoding is usually expressed by specifying the fixed value in the Transformation section and leaving the Source section empty).

However, as we need to handle more complicated message structures and more sophisticated mappings, some limitations become apparent.

Source-oriented mapping specs are not convenient

Many analysts build their mapping sheets by listing all source message fields (in the "topological" order in which they appear in the source message) and then add the corresponding target message fields with transformation information on the right.    This is one of the worst possible things to do.
The purpose of data mapping is to generate target messages, and the developer will have to decompose the implementation of any nontrivial mapping based on the structure of the target message.
Structuring the mapping sheet according to the source message makes thing difficult for the developer, who will likely find the information needed for mapping a target structure scattered across the mapping sheet.   When the same source field, for example a currency code, is mapped to different target fields in completely different places within the target message structure, grouping these mapping lines by source field is really confusing and counterproductive.

Multi-input mappings not expressed cleanly

Whenever a target field needs to be mapped from more than one source field (even if the mapping logic is something as simple as a string concatenation or a sum), then expressing the mapping as Source field - Transformation - Target field is inadequate.   People usually end up specifying one of the mapping inputs as the source field and the others in the Transformation rule (in an unstructured way).  Therefore, it is not immediately clear to the developer which is the complete set of inputs that go into the mapping.

Ambiguity when mapping repetitive structures

Only the simplest mappings do not involve repeating data structures.  Most business documents contain lists of items and quite often there are nested lists (for example, an invoice can contain multiple invoice items, and each invoice item normally contains a list of item-level references and a list of item-level discounts and/or surcharges).
This is probably the area where mapping specifications are most ambiguous, as it is often not clear from mapping sheets what drives a repetitive mapping, and consequently over what the mapping code needs to loop in order to implement it.
In most cases, the repetitive mapping must iterate on a list in the source message (i.e., we have to map one target structure instance for each instance of a source structure), but that is often not well expressed in the mapping sheet, which may not contain clearly marked sections for repetitive mappings.  In some cases, XPath-like expressions used in the sheets do not denote which structure is a sequence (e.g.  /SalesOrder/Items/ProductCode instead of /SalesOrder/Items[]/ProductCode, with the developer being supposed to know that Items is a sequence).

Ambiguity when using filtering conditions

Filtering conditions in mapping specifications come essentially in two forms:
  1. Conditional mapping condition:  a precondition for mapping to be executed at all  (if it evaluates to false, then the specified target remains unmapped).  For example, within an invoice mapping, the mapping of a "related invoice reference" is only performed if the billing document  is not a normal invoice but rather a credit or debit note that reference an earlier invoice.
  2. Sequence filtering/selection conditions:  within the mapping of repeating structure (i.e., a list of strings, records, or documents), we map only from the instances in the list that satisfy a certain condition.  For example, when mapping Ship To party information from an EDIFACT document, we normally have to map from the particular instance of the NAD (Name And Address) segment whose qualifier value equals "DP" (Delivery Party).  The filtering condition can result into zero, one or more target structure instances being mapped. 
In many mapping sheets, these two types of conditions are not clearly distinguished.



Principles for better mapping specifications

This section gives some advice for improvement, so that the developer has an easier time implementing the mapping as specified.

Target orientation

Since the goal is to produce target documents at runtime, it is logical to start a mapping sheet by listing, on its left side, the structures and fields within the target message that need to be mapped.

Do not list all the existing target message fields including those which do not need being mapped to: this will lead to an unreasonably long list of fields which is unwieldy and distracting (especially when dealing with huge standards-based message schemas).  Keep the mapping sheet as concise as possible.  Using mapping sheets for the additional purpose of documenting data structures is not good practice:  data structures should be documented separately, ideally in a Data Glossary.

Since the target message will in most cases contain sub-structures, group the target fields into sections within the mapping sheet (one section per substructure), in the same order in which each field and substructure appears within the target schema definition. This in in preparation of step below (identification of repetitive structures).
It is good practice to leave one blank row between adjacent sections in the mapping sheet for better clarity.

Top-down mapping decomposition

Top-down decomposition is a fundamental principle in all engineering activities.  As applied in this context, the mapping of a complex target message needs to be decomposed into sub-mappings, namely the sub-mappings for the main sub-structures within the target message.
For example, the mapping of an EDI message (definitely a complex structure) must be decomposed into the mappings of its segments and sub-segments (only those in mapping scope, of course).

This decomposition, driven by the structure of the target message, should be explicit already in the mapping sheet.  Even if the developer will not strictly adhere to it in his/her implementation (typically for performance optimization reasons), it will provide a clear, structured view of the mapping.

Clear identification of multi-input mappings

Conceptually, one can take the "functional" view that a target data structure is a function of the source message, and more specifically of a defined set of source fields.
The simplest and most common case is the 1-1 assignment, where the value of a target field is a function of just one source field, and the mapping function (transformation) is a simple copy.

In general, however, a target field value can be a function of more that one source field value (T=f(S1, S2, ..., Sn)).  It is very important to group the field paths of the source fields (on the right side of the mapping sheet) under the target field whose mapping they contribute to (that target field path will be on the left side of the mapping sheet).

Each such grouping forms a small mapping subsection in the mapping sheet, which contains the specification of the mapping function or logic to be applied to the mapping inputs (e.g., replacement in string pattern, sum, maximum, etc.) in the Transformation column.
To avoid ambiguity, when the transformation logic operates on multiple source fields, it should refer to these fields as {1}, {2}, ... where the number in braces represents the order of the source field rows (from top to bottom) within the mapping subsection (not the complete mapping sheet).
Conceptually, each single mapping corresponds to a mapping subsection with the paths of all the necessary source fields listed in separate rows on the RHS of the subsection.   Each mapping subsection must be visually marked in some way (from example using cells borders), to clearly show which set of  source fields participates in the mapping. 

In some cases one of mapping inputs could be a target field already mapped, rather that a source field.  This is mostly done when the mapping logic to obtain a value to be used multiple times is complex, and should not be repeated. A mapping sheet can cope with this practice, by specifying as source field something like target:/<path to target field already mapped>.  It is better practice however to factor out complex mapping logic into "complex mapping functions" (see subsection below), and just reference these functions multiple times in the mapping sheet, each time clearly specifying the inputs to be passed, the latter being strictly source field paths.

If the same source field concurs to the mapping of multiple target fields, then it will have to appear multiple times in the mapping sheet (on the RHS), while a target field should normally appear only once on the LHS of the sheet.

Identification of repetitive structures, repetition triggers, and scopes

Having grouped (as mentioned above) the target fields by the structures they belong to within the target message schema, it is necessary to identify which structures are repeating within the target schema (in general, their multiplicity), and then what drives the repetitions.

Sometimes there is a fixed number of repetitions that is known in advance.  Keeping to the EDIFACT invoice example, we may have to map within the EDI message five instances of the NAD (Name And Address) segment: for the Seller party, the Sold-To party,  the Ship-To party,  the Invoice Receipient party, and the Payer party.
Here there would be five separate mapping sections for the /INVOIC/NAD[] repeating structure, one mapping exactly one instance of the NAD segment from different source data.
Each section should be preceded by the identification of the instance that is being mapped, via an index.  Example:
/INVOIC/NAD[0]   ---> mapping of NAD instance from Seller party
/INVOIC/NAD[1]   ---> mapping of NAD instance from Sold-To party
/INVOIC/NAD[2]   ---> mapping of NAD instance from Ship-To party
etc.
More frequently the mapping of a repeating structure in the target message is driven by a corresponding repeating structure in the source message.  For example, for each invoice line item in the source message we need to instantiate and map a corresponding line item in the target message (in the case of an EDIFACT message, a LIN segment).  The instances of the source repeating structure constitute the repetition triggers for the mapping of the corresponding instances of the target  structure. 
If the source sequence is empty, there is no trigger, so no mapping occurs.

This type of mapping logic is sometimes called "push mapping", the occurrence of structure instances within the source message "pushes" the generation of corresponding structure instances in the target message.  Conversely, "pull mapping" is the way of working discussed above that expresses a given target data item as a function of one or more source data items.
Push mapping is implemented either by looping on the of the repeating source structure (via an explicitly programmed loop or <xsl:for-each> in XSLT),  or by some type of pattern matching rule (e.g., <xsl:template match ="..."> in XSLT).
It is key that the sections for the repeating structures are clearly identified within the mapping sheet, with the target sequence on the leftmost column (along with all target fields) and the corresponding source sequence (repetition trigger) in one of the middle columns (to the left of the source fields).
The path to the repeating structures must be in the first row of its section, ending with index-less brackets ([]) to explicitly denote that we are dealing with a sequence that is to be iterated on in a repetitive mapping.   In addition, the rows for repeating structures should be marked visually via a different background color.

Within a repetitive mapping,  we normally map many fields and sub-structures, so we must identify the source and target mapping scopes.   The source scope is the particular instance of the source repetitive structure that is being mapped from in the given iteration.  The target scope is the particular instance of the target repetitive structure that is being mapped to in the given iteration.
In many cases, values are being mapped from the source scope to the target scope (e.g., from canonical document line items to application specific document line items).
In addition, some repetitive mappings can be be from outside the source scope, because they are hardcoded or are from fields that are not part of the source scope (typical case is the mapping of a currency code to all lines of a target message from the same fields in the header section of the source message).
Finally, repetitive mapping can be nested (e.g. discount and surcharge sequence within an invoice item sequence), thus leading to nested source and target mapping scopes.
Examples:
  • /INVOIC/LIN[] denotes the repetitive LIN structure within the INVOIC message  
  • /INVOIC/LIN[]/QTY[]  denotes the repetitive QTY structure within the repetitive LIN structure within the invoice message (nested scope). 
  • /INVOIC/LIN[0]/QTY[]  denotes the repetitive QTY structure within first instance of the LIN structure within the invoice message  
Note: one could correctly argue that the message root (/INVOIC) is the outermost scope, so every scope within it is a nested scope.
Normally, within each scope, we have to map multiple fields, so we need to specify source and target field paths for these mappings.  Writing out the complete, absolute field paths that start from the message root (e.g., /INVOIC/LIN[]/QTY[]/C186/6060) provides better clarity although it carries come redundancy as the path for the containing repeating structure (/INVOIC/LIN[]/QTY[]).
Note that the braces in a field path such as /INVOIC/LIN[]/QTY[]/C186/6060 do not denote a whole sequence, but just the current instance of the sequence within the mapping scope.
To completely avoid ambiguity in this respect, one can use a dot notation to indicate "current instance" as is /INVOIC/LIN[.]/QTY[.]/C186/6060.   With this notation, /INVOIC/LIN[.]/QTY[] would represent the complete QTY sequence within the current line item (LIN) of the EDIFACT invoice.
Providing field paths relative to the current scope (e.g. just C186/6060 in this example) is more concise but forces the developer to "reconstruct" that complete field path. 
Except in cases where names are especially long and structures very deeply, using the complete path for each field is probably the best option.

Clear distinction between conditional mapping and sequence filtering

In the mapping sheet, these should be separate columns for Mapping Condition and Filtering Condition.
  • Mapping Condition: condition that must be satisfied  in order for the mapping to be executed at all.  This can apply to both simple field mappings (e.g., map only if the source value is not 0) and for repetitive mappings (e.g., map the target sequence only if another source field, not part of the source sequence, meets a certain condition)
  • Filtering/Selection Condition: condition that must be satisfied in the current source scope of a repetitive mapping in order for the current source scope (in other words, the current instance of the source sequence) to be used in the mapping.  For example when mapping the Sold-To customer from an EDIFACT inbound sales order (ORDERS) message, we need to select the NAD segment to map from based its qualifier value being equal to " BY" (= Buyer = Sold-To party).  The column for such filtering/selection conditions can be populated for any field mapping, not just in "sequence" rows.
Quite intuitively, empty mapping or filtering/selection conditions indicate that the mapping is executed unconditionally.
However, to reduce overhead in mapping sheets, the mapping condition "map only if source value is populated" is normally implicit,  If a null source value is a possible situation, distinct from the "source value not populated" situation, then such a mapping condition may have to be explicitly spelled out.

The source fields tested by a condition must be listed together with the source fields used in the actual mapping on the RHS of the sheet, in the mapping subsection of the target field.  As in the Transformation logic, also when writing condition one must uniquely identify the input value via the {n} notation  (denoting the nth source field within the mapping subsection for the target field).
Note that if the condition is very complex, it may be expressed via a  Complex mapping Function (see below).

Isolation of Complex Mapping Functions

When the mapping logic is complex and/or is needs to be reused in multiple places within a mapping sheet, it is good practice to factor out this logic by defining a named Complex Mapping Function with zero or more inputs and one or more output (usually one).
This keeps the clutter in the Transformation column of the sheet to a minimum and avoids redundancy when the logic is applied more than once.
Having specified the function, the analyst just has to reference its name (and the name of the output, it the function produces more than one) in the relevant cells in the Transformation column. If the function has more that one output, a notation such as <output name>=<function name>(...) should be used to indicate which output is used for the mapping.
Defining multi-output functions is useful to reduce redundancy where the logic would be largely shared across two or more functions (e.g.,  there is a complex procedure to determine a first target value, then a second target value is obtained by some computation from the first)

In case of multiple inputs, mapping function should be defined with multiple named input parameters that must be associated with  corresponding source field paths (the "actual" parameters) in the RHS of the mapping sheet.  This is a case of multi-input mapping as described above, so the transformation would be expressed as follows:
[<output name>=]<function name>(<parameter name 1>={n}, <parameter name 2>={m}, ...)
where again {n} and {m} represent the source fields in the n-th and m-th position within the mapping subsection.  The part [<output name>=] is only used in  case the function produces multiple outputs.
When the mapping function has a single input parameter the notation above is unnecessary and one can just write <function name>(.) in the Transformation cell (or <output name>=<function name>(.) if the function has a single input but more than one output).

Complex mapping functions are normally defined at the bottom of the mapping sheet, but if there are several of them and they are shared across multiple mapping sheets of a mapping workbook, then they may warrant a dedicated sheet or tab in the workbook.

It goes without saying that the definition of such function should be a specification for the developer, who can technically implement it in different ways as long as the specification is met.

Resulting mapping sheet structure

Hierarchical organization at row level

The different grouping of mapping sheet rows as described above lead to the following hierarchy at row level:
  1. Row sections for Repeating structures (with possible nesting of sections for sub-structures)
  2. Row sections for single target field mappings (with multiple mapping source fields)
  3. Rows for individual source fields
The mapping sheet section with the Complex Mapping Functions definitions is not part of this hierarchy and is below the mapping rows or in a separate tab.

Suggested columns in a mapping sheet

Based on the practices suggested above in the article, I suggest the following columns (from left to right) for a mapping sheet:
  1. Short Description or Source Data Glossary Link
  2. Target Field/Structure Path
  3. Mapping Condition
  4. Transformation  
  5. {n}  (source input number)   
  6. Source Field/Structure Path
  7. Filtering/Selection Condition 
For column #1, it is of course preferable to use links to a central glossary to document all or most data fields and structures used in integration work, as explained in previous blog article The importance of Data Glossaries.  Only one Description field is provided to supply general functional information, as it is not the purpose of the mapping sheet to define source or target data formats.

Beyond Mapping Sheets

While the vast majority of mapping specifications consist of Excel sheets, a case can be made for expressing these specifications in a machine-readable Domain Specific Language (DSL) that is still readable by humans.   A YAML-based DSL could be defined, for example.
Although Excel sheets can be exported as CSV for easy machine readability, the hierarchical structure of a mapping specification is not simple to express cleanly without using visual characteristics (borders for scopes, cell coloring, etc.)  which would be lost when exporting to CSV.

One use of a mapping DSL could be to automatically generate mapping code that would be completed and refined by the developer.
Automatic generation of good mapping code is a challenging topic not just technically, as it will place clear responsibility for formal definition of the mapping logic on the analyst, making the analyst in effect a meta-programmer.  Optimization of the generated code is also an issue unless the analyst intelligently factored out shared logic.

Another, more realistic application, would be having software tools process such formal mapping specifications for impact analysis in Change Management.   





 

Thursday, March 12, 2015

ESB Logging (part 2)

This second part of my article about ESB logging describes some techniques that can be adopted to meet the logging requirements described in part 1.

Different techniques and technical choices can be applied at each level of the following generic logging architecture, which should be self-explanatory:
The role of the Logging Transport is to decouple the Logging Publisher from the Logging Data Store and  is an essential one.  A tightly-coupled solution, in which the calling code would have to synchronously wait for data to be persisted to the Logging Data Store before carrying on with its work, would be  inefficient.

Note: in the picture above the logic that serves the UI to browse and search into the Logging Data Store is not explicitly represented;  however this logic must exist and be usable in order to access the data in the store in an efficient way.  This article does not treat this part of the solution.



Details of Logging Architecture Layers


We will now will deal with the three layers in sequence and how they can contribute to providing the  logging functionality we need.

Logging Publisher

The Logging Publisher component must be directly callable from any ESB component via a simple API.   It is imperative that:
  •  The logging call be simple. If not, developers will need too much effort to place logging calls throughout the integration code.  The simplicity of the logging call largely depends on the number of its mandatory inputs.  In most designs, the severity level and the log message  should be the only mandatory inputs that the developer is explicitly required to supply.
  • The logging call be efficient, imposing the minimum possible overhead on the calling code.  The Logging Publisher must decide in the minimum possible time whether to hand over the data to the Logging Transport and, if that's the case, the handover itself must be as quick as possible.
In addition the the "log"  API,  a "flush" API will be necessary if the Deferred Logging technique (described at the end of this article) is adopted.
These APIs constitute a very simple contract between the logging solutions and its clients: the logging solution can freely evolve as long an the contract is not broken.


Explicit and Implicit inputs to logging calls

Part of the logged data are passed explicitly by the caller:
  • Severity Level of the logging entry (e.g., DEBUG, INFO, WARNING, ERROR, FATAL)
  • A Log Message.  This can be a potentially long string of text, which can be include dynamic sections.  It is not, however, to be used to store a whole message (which is the purpose of the document attachments mentioned below)
  • A Log Id to tag the precise location in the code from which the logging call is made; use of this identifier may be optional or mandatory, depending on the logging design
  • One or more document attachments (representations of data objects treated by the ESB component, like for example a service request payload); the persistent storage of documents associated with logging entries is often called archiving and is treated separately later in this article.

Other data elements are normally extracted implicitly and automatically by the Logging Publisher function:
  • A logging time-stamp
  • The name of the ESB component calling the Logging Publisher function
  • The server and port associated with the execution of the ESB component
  • The user id under which the ESB component is executing 
  • The consumer id for which the ESB component is executing; this may not be present in case the ESB component does not implement a service (e.g., in the case the component is of the polling kind); also, depending on the authorization design, the consumer id may be the same as the user id mentioned above
  • Tracking or Correlation Identifiers that unequivocally relate to the service execution and correlate all logging entries originating from this execution (more on this later)
  • Optionally, an incremental sequence number that sequences logging entries generated by the same execution (i.e., that have the same correlation identifier value)
In order to retain all the above information across logging calls from the same execution the Logging Publisher must be stateful.  To achieve this, two main approaches are used:
  1. Deferring logging state to the client of the Logging Publisher API (i.e., the integration code), by having logging state information "carried around" throughout the service implementation logic in some kind of standard "control header" or "execution context" structure.
  2. Caching logging state.
Many times, the first approach is preferred for its simplicity (and marginally better performance).
However, relying on the integration code to always possess a valid reference to context information is not always simple, especially in the face of exception situations.  This requires very careful coding and good QA on the produced integration code.  
The other approach, which is to be considered a best practice, is described next.

Caching logging state 

Caching logging state in memory, using caching solution such as Ehcache,  makes the Logging Publisher component more robust as it does not rely on client code to retain state on its behalf. 
A memory cache (which could be local to the runtime environment or distributed) essentially behaves like a big hash table holding data objects can be accessed in near-constant time based on their corresponding keys.  In many cases, the keys are simply String objects and the values are any kind of serializable objects.
The logging state can be easily wrapped into one such object, and once it gets stored into the cache it can be retrieved reliably from any point of the service or component logic later on. The essential precondition for this is that the corresponding key must be:
  • A unique identifier for the execution instance
  • retrievable in a simple and efficient ways from the ESB runtime
Most ESB products have some form of unique execution identifier.  For example, the Software AG webMethods ESB maintains a globally unique root context id for each synchronous thread of execution.  In BPM contexts, the internal process instance id  (also normally a GUID) can be used for this purpose.

To prevent memory leaks, the ESB service/component/process logic must ensure that the logging state object is purged when a runtime execution instance (service, component, or process) ends, whether successfully or with an error.  The use a of a memory cache rather than a simple hash table is helpful here, as cache products usually allow the specification of a Time-To-Live (TTL) for cache entries.  Thus, even in the unlikely cases in which the ESB logic is not able to purge an entry, it will get purged automatically after expiration (i.e., when its age exceeds its TTL).

It goes without saying that the TTL must be tuned in order to exceed the maximum potential lifetime of any runtime instance.


Conditional logging based on Logging Level

When the Logging Publisher API is called, it will pass its data (received from the caller + collected automatically) to the Logging Transport  in order to persist it.

The decision to persist (or not) a logging entry normally depends on its Severity Level (as passed by the caller) relative to the Severity Level Threshold set at system or ESB component level.
It makes sense in most cases to immediately deliver to the Logging Transport any logging entry  whose severity exceeds the applicable Severity Level Threshold, so there is the shortest possible lag time between the generation and the persistence of these events. 

For better supportability, it is important that Severity Level Thresholds can be set into a data store that is accessed dynamically at runtime, so logging can be "turned up" and "turned down" on the fly by application support staff. 

The logging thresholds (global or by component) must be configurable in a persistent store (such as a DB), but must be accessed very frequently, every time the Logging Publisher function is called.  To avoid excessive overhead, caching is essential.   Application Support could use a custom UI to to alter the logging thresholds in the persistent store, and then force a refresh/reload of the cache to have the new values take effect in real time.

In most designs, logging events having severity lower than the applicable Severity Level Threshold are just ignored, but is some cases they may still be processed later in case of an error (see section Deferred Logging at the end of this article).
The association of sequence numbers to logging entries allows in any case to chronologically sort the entries from a certain execution instance, regardless of which entries were persisted first.


Logging Transport

Generally, once the Logging Publisher invokes the Logging Transport for a logging entry (or a set of related logging entries), the data passed to the transport is considered as committed and will be eventually persisted.  

The Logging Transport must:
  • expose a very simple API towards the Logging Publisher,
  • internally operate asynchronously with respect to the Logging Publisher . 
Once the Logging Publisher decides that an entry is to be persisted, it simply calls the transport API to enqueue the entry  and then returns very quickly to its caller.
The processing from the Logging Transport to the Logging Data Store is completely decoupled.

In the simplest implementations, the Logging Transport just consists in a wrapper around the access protocol of the Logging Data Store.  For example, with a relational database used as Logging Data Store, the Logging Transport API would just wrap a DB call (or set of calls), executed in separate thread.  This solution shows its shortcomings in the face of peaks of logging traffic, where the DB pools size is exhausted and logging entries must be discarded (with loss of potentially important logging information).

A robust solution for the Logging Transport should use persistent queues for temporary storage of log entries to be persisted into the Logging Data Store.   This provides the necessary buffer to cope with variable loads.
The queuing solution used by logging should be totally independent, both architecturally and infrastructurally, from the asynchronous messaging solution used for business critical messages (as part of the integration logic).  A very high logging volume should not compromise the throughput and latency of business critical messages, and certainly we want to completely avoid risk that a flood of log entries could bring the main messaging system to its knees.

The queuing infrastructure used by logging should be lightweight and low-latency, such as for example ZeroMQ, or Software AG / Terracotta Universal Messaging.
These are interesting possibilities for advanced use of the Logging Transport, such as:
  • Holding entries in queue until a "flush" methods is called, and releasing them to the Logging Data Store only then (this is required by a particular implementation of the Deferred Logging technique,  described at the end of this article).
  • Aggregating related entries from the queue before persisting them into the Logging Data Store.  This is useful above all when a NoSQL Document DB is used as Logging Data Store, in order to bundle multiple related entries into a single document.



Logging Data Store

As already stated in part 1, the logging data store must be logically centralized in order to be effective, even if it physically distributed. 
The most common way of implementing a centralized data store for logging is via some kind of database.  ESB tools mostly use relational DB tables for their standard logging solutions, but very often custom logging data stores are designed.  These designs range from a simple set of relational tables to the use of noSQL, document-oriented databases such as MongoDB.

What matters, really, is that the data store design is optimized for:
  • insertions
  • queries by a given set of indexed fields
The latter property is crucial to enable the efficient retrieval of related logging entries. Relationships will be based on the correlation identifiers described later, and possibly also on additional, configurable "classification keys" that may have functional meaning.


Logging Data Store and Message Archiving

If we take away the message archiving requirement, that is, the inclusion of potentially big message payloads into audit logging, then it makes sense to simply use a physically centralized logging data store.
Things change a bit when message archiving is is scope (as it is in most cases).  Having distributed processing nodes send big messages (which could be sometimes in the order of tens of MB or even larger) to a centralized store over the network is inefficient.

A more efficient approach is to store in the central DB everything with the exception of message payloads, and keep these locally at the node that generated them.  The centralized DB will hold location metadata about these distributed payloads:
  • hosting node 
  • unique local access path (typically including a globally unique file name)
The rationale behind this choice it that is is very efficient for the processing node to store a payload locally, while access to these payload will be very infrequent (only when the user browsing the logging data store "drills down" into the payload part of a given logging entry).

The downsides of distributed archiving are mainly:
  • the need for a technical service at each node that delivers a local payload on demand
  • the need for an agent at each node that manages payload retention to manage space according to retention policies (deleting old payloads)
In presence of massive amounts of messages to be archived, the use of the Hadoop Distributed File System (HDFS) can be considered.



Specific logging functionality

  

Correlation

This is a cornerstone aspect of each logging solution, which must be able to handle multiple correlation identifiers (all of which should be GUID's).

Local Correlation Identifier

This identifier  is generated at the beginning of each thread involved in the execution of a service.
If the service is not a simple synchronous service, then each  invocation originates more than one value for this identifier (one value for each synchronous execution context).
Only logging entries generated within the same synchronous execution context share the same value for this identifier. This local correlation identifier can be normally directly derived from the ESB runtime context (e.g., root context Id in webMethods or session id in Mule).

Global Correlation Identifier

This is is the most important correlation identifier. It has the value of the Local Correlation Identifier of the first ESB component that is executed as part of a service invocation.  The service logic must then propagate it across the whole invocation, regardless on whether the components involved are loosely or tightly coupled. This is achieved technically via protocol properties depending of the implementation of the service (e.g., SOAP headers, HTTP headers, JMS properties, etc.)
The ubiquitous availability of this correlation identifier throughout the execution chain is what allows all logging entries to be effectively correlated.
Finally, it is a best practice for services to return the Global Correlation Identifier as part of each service response, allowing consumer application to create a cross-reference between the ESB Global Correlation Identifier and any internal identifier (for instance the one used as "Tracking ID", described below).

External Correlation Identifier ("Consumer Tracking ID")

Technically, this identifier may be handled exactly like the Global Correlation Identifier, but there is an important difference: it is generated externally by the service consumer, not internally by the ESB. Its main purpose is to allow tracking via a "source" identifier that has meaning in the context of the consumer application.  It can thus be called the  "Consumer Tracking ID"

Even though the ESB may enforce that each consumer supply a Tracking ID (rejecting service requests that are lack this ID), the ESB cannot realistically enforce its uniqueness.  For this reason, it cannot normally be used in lieu of the Global Correlation Identifier.  Thus, rather than propagating the External Correlation Identifier throughout  service execution, it is sufficient to simply associate it with the  Global Correlation Identifier in the Logging Data Store.


Logging and BPM

When using a BPM solution to build Orchestrated Task Services, custom logging is usually less necessary as most BPMS tools offer ample process audit and monitoring capabilities.  For each process instance, the sequence of states (i.e., sequence of traversed process steps) is tracked, usually with the possibility of logging selected input and outputs for process model steps.

Virtually all BPM engines natively support the definition of a Process Correlation ID to join events to running process instances.  This correlation identifier normally has functional significance and is to be considered the equivalent of the External Correlation Identifier discussed above.

Each process instance is assigned a unique Process Instance ID (PID) by the BPMS, which is to be considered the equivalent of the Local Correlation Identifier (although it is not in general associated with a single thread of execution).   

For the Global Correlation Identifier, a similar logic can be followed as for non-BPM integrations, by adopting the PID of the "root" process instance and propagating this value to all sub-processes, regardless of whether they are detached or not.  Detached sub-processes will have their own instance ID (= Local Correlation Identifier), but can still share the propagated Global Correlation Identifier (and the Process Correlation ID = External Correlation Identifier).

In case the out-of-the-box functionality of the BPMS does not allow easy audit logging, the techniques described in this post are perfectly applicable to automated business processes,  with the equivalences just mentioned.


 Deferred Logging

In Production environments, it is common practice to log only an entry at the beginning and one entry at the end of the execution of a service or component is case of success (typically, archiving the input and output message at the same time).  The Severity Level Threshold is normally kept to "ERROR" or "WARNING" since the logging system would be swamped if the level would be set to "INFO or "DEBUG" across all services at typical Production volumes.

This normally limits the logging information available when an error occurs to two entries: the "execution start" entry and the "error occurred" entry, with no information in between.
When a component or service instance experiences an error,  it would be useful if the maximum amount of logging information generated prior to the error were made available irrespective of the Severity Level Threshold.
This is valuable even for a BPM solution, which intrinsically provides better tracking compared to a non-BPM solution, as it provides detailed visibility of what happens within process activities.

The technique called Deferred Logging consists in temporarily holding all logging information (of all logging levels right down to "DEBUG"), until the execution of the service/component/process instance terminates, at which point a "flushing" action is triggered:
  • If the instance terminates successfully, the pending logging entries are filtered based on the current Severity Level Threshold .  Only the entries meeting or exceeding the threshold are flushed to the Logging Data Store (e.g., if the threshold is set to "WARNING", then "INFO" and "DEBUG" entries are discarded and not persisted into the store).
  • If the instance terminates with an error, then all pending logging entries  are flushed to the Logging Data Store, this storing a complete "history" of what happened prior to the error.

There are two basic approaches to implementing Deferred Logging:
  • The Logging Publisher component caches the pending logging entries in memory (normally along with the rest of the logging state), and performs the "flushing" action at the end of the instance execution (this is triggered via a special Logging Publisher API method).
  • The Logging Transport is designed to not deliver to the Logging Data Store immediately, but to hold logging entries pending a flush action (or their expiration).  Here, also, the flush action is triggered at at the end of the instance execution via a Logging Publisher API method.  This second option is more complicated but more robust, as pending entries are kept in persistent storage rather than in a memory cache.

Conclusion

Once a flexible logging API is designed, its implementation (the logging solution) can be made more and more sophisticated over time.

The advice is to start small and evolve, based on the supportability requirements that are specific to your organization.
Depending on the requirements, some logging features could be skipped in the first implementation, but it is my opinion that the 3-tier logging architecture presented here (with asynchronous Logging Transport) is a good foundation for a robust and scalable solution.

Wednesday, October 22, 2014

ESB logging (part 1)

Logging is one the fundamental cross-cutting concerns in any software system.

This three-part article will discuss the ESB logging functionality in part 1, will propose some techniques to address these requirements in part 2, and finally address in part 3 the implications of logging in a BPM environment (i.e., when services are orchestrated by a BPMS). 

Functional and technical requirements of ESB logging

In ESB systems, there are several reasons why we need to do logging, the main ones being:
  1. To keep a functional audit of integration activities (sometimes called "transaction logging"); this is essentially a traceability requirement.
  2. To have detailed troubleshooting information available in case of errors or malfunctions
  3. To collect performance information, which helps us identify bottlenecks in our service implementation 
These three points can be considered as very  high level functional requirements for ESB logging.   As always, depending on the specifics, each of these translates into technical requirements that must be appropriately addressed.  The following subsections will go more in depth discussing these three macro-requirements.

Functional audit

The creation of a functional audit trail is normally mandated by business policies and/or by external regulation.  In most cases, it is required to include a copy of of the payload of the service (API) requests and responses.  In other cases, not only the payload but also all control information (in the form of SOAP headers for SOAP services, HTTP headers and status code for REST services, JMS properties for asynchronous JMS services, etc.) must be logged. Logging that includes payloads is sometimes called archiving.
Strictly speaking, the usage of the term "archiving" is not appropriate as the term should be used only in those cases where the system keeps persistent copies of the payloads in normal operation, which is typically the case for B2B gateways.  After a set period, these messages are moved from the "online" data store to the "offline" archive store. 

Looking at the audit requirement more in detail, we need to define:
  • Where to log (audit points)
  • What to log (logging content)
  • How long to keep the logs (retention policy)

For discussing the first point (perhaps the most important of the three), consider the diagram below that shows a rather simple example of an ESB service implemented via a single ESB component, composing native services of two different backend systems but not composing other ESB services:



 The diagram shows six different audit logging points:
  • two at the ESB service endpoint, facing the consumer (logging its request and the response delivered to it) 
  • six facing the backend systems, logging requests and responses exchanged between the ESB (in a service consumer role) and the native endpoints of the backend systems.
As a minimum, audit logging must be done at the two consumer-facing endpoints, but for comprehensive logging all messages that enter and exit an ESB component must be included:
  1. Inbound request from service consumer
  2. Outbound request to native backend service
  3. Inbound response (or fault) from native backend service
  4. Outbound request to composed ESB service
  5. Inbound response (or fault) from composed ESB service
  6. Outbound request (or fault) to service consumer
Note: by "fault" I here mean any structured set of error information, not necessarily a SOAP fault.

Clearly, logging points of type 2 through 4 are due to service composition, so the number of such logging points is proportional to the number of service operations being composed.

It is necessary to correlate the log entries generated at each audit logging point, which can be done in two ways within a component:
  • by logging independently at each audit logging point (generating a separate log entry at each point), but having the log entries share a common unique identifier
  • by grouping the log information generated at each audit logging point into single log entry which is written when the service completes execution (either with success or with a fault).  This can be called delayed logging.
The choice between minimal audit logging (consumer request and response only) and full audit logging should be configurable at the level of service and service operation (or resource / sub-resource level for a REST service), with the usual pattern of configuration inheritance and override between parent and child resource (the setting for service is inherited by service operations unless specifically overridden at operation level).


The second point relates to the content of the log entries.  A minimal set of information fields for the logging function will be suggested in part 2 of this article, but here the important point to remark is that information logged at each log audit point may be heterogeneous, and greatly vary in size.

Often the information is already in textual form (e.g., XML payloads, JSON payload, SOAP headers, HTTP headers), but in some cases we have binary information (e.g., scanned documents).
Even when information is textual, it may not arrive as a String at the ESB endpoint (example: posted HTTP body received as a Java InputStream object), and that mostly depends on the particular ESB tool used.
Sometimes message payloads can be very large, and this must taken into account as well.

Thus, the format and size of the data to be logged has a significant impact on the design of ESB logging, which must be flexible in this respect without introducing excessive overhead.

The third point relates to the retention period of the entries created into the audit log.  Here, like in the case of the determination of audit logging points, it should be possible to configure the retention period hierarchically by service / operation (or parent / child resource), so that information related to more critical operations or resources can be retained longer.   A significant plus, on top of this functionality, would be to allow a retention policy that also takes into account the identity of the consumer.


Technical logging

While the preceding subsection was about audit logging (that is logging that must be done at every service invocation due to functional requirements), this section briefly discusses the additional technical logging that can be done in addition to audit logging to provide useful information about service behavior, which can be essential to troubleshoot obscure malfunction detected in the Production environment.

First of all, let's clear the field from the rather common misconception that mixes together such technical logging and error handling

Technical logging is NOT the same as error handling.  Error handling is ESB logic that must be executed in all cases where an error occurs that requires manual recovery (part of an Incident Management workflow), or that must anyway be notified to a human user (even when the error is recovered from automatically), and the error/fault response to the service consumer is not sufficient.

On the other hand, technical logging, when enabled, is done from  ESB components independently of whether error conditions are happening or not.  It simple traces the execution logic within components at a fine-grained level of detail, providing detailed insight into what happens between the audit logging points.

This does not mean that the implementation of logging can never share anything with that of error handling.  In many ESB environments, for example, designers have decided to make use of the Log4J framework as part of both their logging and the their error handling solutions.

The first consideration about this kind of logging is that, for performance reasons, it should be completely turned off in normal Production operation.  Such logging is normally enabled in environments that support Unit Testing and System Integration Testing, but normally disabled not just in Production but also in every environment that must closely mirror Production conditions, like in environments where Performance Testing happens.

Still, enabling technical logging in a very granular way (e.g. per component or component/consumer combination) can be occasionally invaluable to troubleshoot incorrect behavior in Production.  This must be done, of course, very carefully, due the the performance impact in case a component implements a service that is heavily used.

It is imperative here to have the capability of hierarchical, fine-grained configuration, so the appropriate level of logging can be set very flexibly.

Normally, technical logging should not include payloads but just short dynamic log messages that can contain the values of selected variables in the component's logic.   

An interesting option for technical logging (in between the "on" and "off" settings)  is that of conditional logging:  technical log entries are logged (at the end of the execution of a component, that is with delayed logging), but only if the component experienced an error.  By choosing this option we recognize that detailed tracking of successful service executions is superfluous, which is almost always the case.   On the other hand, "mass" errors, typically due to unavailability of a backend resource, will produce under this option a very large amount of logging information that is utterly redundant (since all service invocations failed for the same reason).

Performance logging (profiling)

Since logging entries (both audit and technical) are associated with a timestamp, it is obvious that logging can be used to collect performance statistics (profiling) for service operations.  The relevant log entries for are only those logged at the audit logging points, for two reasons:
  • The technical logging points are normally too dense within a component
  • Technical logging in Production and during Performance Testing introduces an overhead which is normally unacceptable and that in any case will bias the profiling results (the act of measuring affects the measurements).
The main statistics (min, max, average, standard deviation, etc.) that can be collected are:
  • those related to the distribution of service operation invocations (by time period, consumer, etc.)
  • those related to the total service operation execution time, within the ESB component exposing the operation
  • those related to the execution time of composed services; these, combined with the total service execution time statistics of the services being composed, can provide statistics for message transit times between components (significant when the services being composed are dispersed geographically)



Log entry consolidation, correlation and sequencing

The diagram used above to illustrate the audit logging points shows a very simple situation where all logging points are reached sequentially within the execution thread of the (only) component.  In this case, correlation and sequencing of the produce log entries is trivial.

The situation changes a lot when we have to deal with complex service compositions, distributed ESB components, asynchronous interactions, clustered environments, and services orchestrated by BPM.

Even with a rather simple synchronous composition of ESB services, it not guaranteed that these services run in the same thread:  a "local" service invocation in a clustered environment may executed the service in a cluster node other than that where the caller service executes.

In general, anyway, we must assume that composed services are implemented by components that are distributed in the network, when they are invoked asynchronously, when they are invoked asynchronously via a message broker (for example,  in the case of SOAP-over-JMS invocations) and when they are invoked as business process steps within a process instance.

Each component involved in a service execution makes creates multiple logging entries in the Logging Data Store, and all these entries must all be mutually correlated and sequenced in order to give a clear picture to a person browsing the log.

An example, purposefully a bit involved, is given in the following diagram:


In the picture above a distributed service composition combines functionality from 4 different backend systems.
A main service (implemented on component 1) hosted on machine A directly or indirectly invokes three other services (one hosted on the same machine, and the other two on different machines).  The service hosted on machine C (component 4) is invoked asynchronously through a messaging queue.

Each service component performs logging at its audit logging points and perhaps some components will also have technical logging enabled.   As a result of logging entries can originate from different machines and must be correlated and consolidated  into a single place (the Logging Data Store).  That, by itself, can pose a concern: we can realize upfront that a trivial logging implementation like, for example, using a separate synchronous INSERT to a central database for every individual log entry can be inefficient and hurt our service response times.
 
The Logging Data Store does not need to be physically centralized, but it needs to be centralized logically, allowing a User Interface to access all of its information no matter where it is physically stored.

In order for the audit and logging information to make sense to somebody who browses and analyses the Logging Data Store,  log entries originating from different components that relate, directly or indirectly, to the same invocation by the External Service Consumer must be:
  1. Logically grouped together
  2. Accessible as a whole in a location-independent fashion (i.e., even when not all related entries are store in the same physical data store)
  3. Organized hierarchically based on service composition logic  (e.g., showing the execution of component 3 as a subordinate of the execution of component 2, which is in turn a subordinate of the execution of component 1)
  4. Sequenced correctly (regardless of deviation of the system clocks of the different machines).  This is necessary for proper profiling.

Part 2 of this article will be discussing techniques to help us achieve without affecting (too much) the performance of our services.