Tuesday, September 15, 2015

Structured Mapping Specifications

Data mapping and transformation constitute a key part of the vast majority of integration solutions.
Notable exceptions are the cases where the same data model is shared across the whole integration, from service consumers to backend systems (a case not very frequent in practice), and the case of pure file transfers.

In virtually all other cases there is always the need to map source messages into target messages.

A high-level view of mapping

Below are a few definitions and assumptions to define the context of this article.

A Message is a tree-structured data container modeled after the Message Enterprise Integration Pattern.
All practical message formats (XML, JSON, flat file, EDI file) can be parsed into a tree-structure representation.

A Simple Message Mapping
  •  transforms exactly one source message into exactly one target message (1-1 mapping)
  •  can achieve both structural and value transformations (value transformations can include any function including value computation and value cross-referencing, for example)
  • only uses information that is present in the source message or in the integration environment (e.g., timestamp generation, UUID generation, use of environment properties, etc.)
A Complex Message Mapping can be of these different kinds:
  • 1-N mapping: multiple target messages are produced from the same source message. This case is trivial as it can be simply expressed as the "parallel" combination of multiple simple (1-1) message mappings.
  • N-1 mapping: in this case multiple source messages are transformed into a single target message.  The source messages can result from a message join or correlation (typically in a BPM solution), but can also include secondary source messages produced as output by data enrichment functions that are invoked based on a "main" source message as input.  Without analyzing this case in detail (this may be the subject of a subsequent post) one can be intuitively grasp that such a N-1 mapping can be implemented as a "network" of 1-1  mappings (some of while providing the inputs for one or more Data Enrichers).  Once the source messages are defined, a mapping rule can be expressed by qualifying each source field with the source message it belongs to.
  • N-M mappings:  these are the most general as they produce M target messages from N source messages, can be defined as a set of N-1 mappings. 
The rest of the article discusses ways of specifying  simple message mapping only, leaving the topic of complex message mapping specification to a future post.


Mapping specification issues

Traditionally, mapping specifications are being delivered to integration developers by analysts in the form of mapping spreadsheets (the so-called Mapping Sheets), and integration developers need to figure out how to best implement the specification using a combination of tools, based on the integration platform they are working on (tool specific visual mapping configuration, Java code using mapping utility classes, XSLT sheets, etc.)

Almost always, these mapping specifications are ambiguous to a certain degree. Thus, they require multiple subsequent clarifications with the analyst and/or they are misinterpreted by the developer (so it turns out during integration testing that the implemented mapping was not what the analyst intended and needs to be reworked).

This article suggest some ways to improve on this situation.

Let's look first at the way mapping sheets are generally structured, and some reason why they are ambiguous (I do not have the ambition to be exhaustive here, as people can be very creative when it comes to being unclear with their specs).
What we generally find are Excel sheets structured in three sections, from left to right:
  • Source: specifies the source field for the mapping  (generally via some XPath-like notation)
  • Transformation: specifies the transformation logic (if any, beyond simple value copy) necessary to obtain the value for the target field from the value of the corresponding source field
  • Target:  specifies the target field for the mapping  (gain, mostly via some XPath-like notation)
This structure is more than adequate for simple cases, where the source and target messages are simple records (without repeating sub-records) and each target field is obtained either from a single source field or is hardcoded (hardcoding is usually expressed by specifying the fixed value in the Transformation section and leaving the Source section empty).

However, as we need to handle more complicated message structures and more sophisticated mappings, some limitations become apparent.

Source-oriented mapping specs are not convenient

Many analysts build their mapping sheets by listing all source message fields (in the "topological" order in which they appear in the source message) and then add the corresponding target message fields with transformation information on the right.    This is one of the worst possible things to do.
The purpose of data mapping is to generate target messages, and the developer will have to decompose the implementation of any nontrivial mapping based on the structure of the target message.
Structuring the mapping sheet according to the source message makes thing difficult for the developer, who will likely find the information needed for mapping a target structure scattered across the mapping sheet.   When the same source field, for example a currency code, is mapped to different target fields in completely different places within the target message structure, grouping these mapping lines by source field is really confusing and counterproductive.

Multi-input mappings not expressed cleanly

Whenever a target field needs to be mapped from more than one source field (even if the mapping logic is something as simple as a string concatenation or a sum), then expressing the mapping as Source field - Transformation - Target field is inadequate.   People usually end up specifying one of the mapping inputs as the source field and the others in the Transformation rule (in an unstructured way).  Therefore, it is not immediately clear to the developer which is the complete set of inputs that go into the mapping.

Ambiguity when mapping repetitive structures

Only the simplest mappings do not involve repeating data structures.  Most business documents contain lists of items and quite often there are nested lists (for example, an invoice can contain multiple invoice items, and each invoice item normally contains a list of item-level references and a list of item-level discounts and/or surcharges).
This is probably the area where mapping specifications are most ambiguous, as it is often not clear from mapping sheets what drives a repetitive mapping, and consequently over what the mapping code needs to loop in order to implement it.
In most cases, the repetitive mapping must iterate on a list in the source message (i.e., we have to map one target structure instance for each instance of a source structure), but that is often not well expressed in the mapping sheet, which may not contain clearly marked sections for repetitive mappings.  In some cases, XPath-like expressions used in the sheets do not denote which structure is a sequence (e.g.  /SalesOrder/Items/ProductCode instead of /SalesOrder/Items[]/ProductCode, with the developer being supposed to know that Items is a sequence).

Ambiguity when using filtering conditions

Filtering conditions in mapping specifications come essentially in two forms:
  1. Conditional mapping condition:  a precondition for mapping to be executed at all  (if it evaluates to false, then the specified target remains unmapped).  For example, within an invoice mapping, the mapping of a "related invoice reference" is only performed if the billing document  is not a normal invoice but rather a credit or debit note that reference an earlier invoice.
  2. Sequence filtering/selection conditions:  within the mapping of repeating structure (i.e., a list of strings, records, or documents), we map only from the instances in the list that satisfy a certain condition.  For example, when mapping Ship To party information from an EDIFACT document, we normally have to map from the particular instance of the NAD (Name And Address) segment whose qualifier value equals "DP" (Delivery Party).  The filtering condition can result into zero, one or more target structure instances being mapped. 
In many mapping sheets, these two types of conditions are not clearly distinguished.



Principles for better mapping specifications

This section gives some advice for improvement, so that the developer has an easier time implementing the mapping as specified.

Target orientation

Since the goal is to produce target documents at runtime, it is logical to start a mapping sheet by listing, on its left side, the structures and fields within the target message that need to be mapped.

Do not list all the existing target message fields including those which do not need being mapped to: this will lead to an unreasonably long list of fields which is unwieldy and distracting (especially when dealing with huge standards-based message schemas).  Keep the mapping sheet as concise as possible.  Using mapping sheets for the additional purpose of documenting data structures is not good practice:  data structures should be documented separately, ideally in a Data Glossary.

Since the target message will in most cases contain sub-structures, group the target fields into sections within the mapping sheet (one section per substructure), in the same order in which each field and substructure appears within the target schema definition. This in in preparation of step below (identification of repetitive structures).
It is good practice to leave one blank row between adjacent sections in the mapping sheet for better clarity.

Top-down mapping decomposition

Top-down decomposition is a fundamental principle in all engineering activities.  As applied in this context, the mapping of a complex target message needs to be decomposed into sub-mappings, namely the sub-mappings for the main sub-structures within the target message.
For example, the mapping of an EDI message (definitely a complex structure) must be decomposed into the mappings of its segments and sub-segments (only those in mapping scope, of course).

This decomposition, driven by the structure of the target message, should be explicit already in the mapping sheet.  Even if the developer will not strictly adhere to it in his/her implementation (typically for performance optimization reasons), it will provide a clear, structured view of the mapping.

Clear identification of multi-input mappings

Conceptually, one can take the "functional" view that a target data structure is a function of the source message, and more specifically of a defined set of source fields.
The simplest and most common case is the 1-1 assignment, where the value of a target field is a function of just one source field, and the mapping function (transformation) is a simple copy.

In general, however, a target field value can be a function of more that one source field value (T=f(S1, S2, ..., Sn)).  It is very important to group the field paths of the source fields (on the right side of the mapping sheet) under the target field whose mapping they contribute to (that target field path will be on the left side of the mapping sheet).

Each such grouping forms a small mapping subsection in the mapping sheet, which contains the specification of the mapping function or logic to be applied to the mapping inputs (e.g., replacement in string pattern, sum, maximum, etc.) in the Transformation column.
To avoid ambiguity, when the transformation logic operates on multiple source fields, it should refer to these fields as {1}, {2}, ... where the number in braces represents the order of the source field rows (from top to bottom) within the mapping subsection (not the complete mapping sheet).
Conceptually, each single mapping corresponds to a mapping subsection with the paths of all the necessary source fields listed in separate rows on the RHS of the subsection.   Each mapping subsection must be visually marked in some way (from example using cells borders), to clearly show which set of  source fields participates in the mapping. 

In some cases one of mapping inputs could be a target field already mapped, rather that a source field.  This is mostly done when the mapping logic to obtain a value to be used multiple times is complex, and should not be repeated. A mapping sheet can cope with this practice, by specifying as source field something like target:/<path to target field already mapped>.  It is better practice however to factor out complex mapping logic into "complex mapping functions" (see subsection below), and just reference these functions multiple times in the mapping sheet, each time clearly specifying the inputs to be passed, the latter being strictly source field paths.

If the same source field concurs to the mapping of multiple target fields, then it will have to appear multiple times in the mapping sheet (on the RHS), while a target field should normally appear only once on the LHS of the sheet.

Identification of repetitive structures, repetition triggers, and scopes

Having grouped (as mentioned above) the target fields by the structures they belong to within the target message schema, it is necessary to identify which structures are repeating within the target schema (in general, their multiplicity), and then what drives the repetitions.

Sometimes there is a fixed number of repetitions that is known in advance.  Keeping to the EDIFACT invoice example, we may have to map within the EDI message five instances of the NAD (Name And Address) segment: for the Seller party, the Sold-To party,  the Ship-To party,  the Invoice Receipient party, and the Payer party.
Here there would be five separate mapping sections for the /INVOIC/NAD[] repeating structure, one mapping exactly one instance of the NAD segment from different source data.
Each section should be preceded by the identification of the instance that is being mapped, via an index.  Example:
/INVOIC/NAD[0]   ---> mapping of NAD instance from Seller party
/INVOIC/NAD[1]   ---> mapping of NAD instance from Sold-To party
/INVOIC/NAD[2]   ---> mapping of NAD instance from Ship-To party
etc.
More frequently the mapping of a repeating structure in the target message is driven by a corresponding repeating structure in the source message.  For example, for each invoice line item in the source message we need to instantiate and map a corresponding line item in the target message (in the case of an EDIFACT message, a LIN segment).  The instances of the source repeating structure constitute the repetition triggers for the mapping of the corresponding instances of the target  structure. 
If the source sequence is empty, there is no trigger, so no mapping occurs.

This type of mapping logic is sometimes called "push mapping", the occurrence of structure instances within the source message "pushes" the generation of corresponding structure instances in the target message.  Conversely, "pull mapping" is the way of working discussed above that expresses a given target data item as a function of one or more source data items.
Push mapping is implemented either by looping on the of the repeating source structure (via an explicitly programmed loop or <xsl:for-each> in XSLT),  or by some type of pattern matching rule (e.g., <xsl:template match ="..."> in XSLT).
It is key that the sections for the repeating structures are clearly identified within the mapping sheet, with the target sequence on the leftmost column (along with all target fields) and the corresponding source sequence (repetition trigger) in one of the middle columns (to the left of the source fields).
The path to the repeating structures must be in the first row of its section, ending with index-less brackets ([]) to explicitly denote that we are dealing with a sequence that is to be iterated on in a repetitive mapping.   In addition, the rows for repeating structures should be marked visually via a different background color.

Within a repetitive mapping,  we normally map many fields and sub-structures, so we must identify the source and target mapping scopes.   The source scope is the particular instance of the source repetitive structure that is being mapped from in the given iteration.  The target scope is the particular instance of the target repetitive structure that is being mapped to in the given iteration.
In many cases, values are being mapped from the source scope to the target scope (e.g., from canonical document line items to application specific document line items).
In addition, some repetitive mappings can be be from outside the source scope, because they are hardcoded or are from fields that are not part of the source scope (typical case is the mapping of a currency code to all lines of a target message from the same fields in the header section of the source message).
Finally, repetitive mapping can be nested (e.g. discount and surcharge sequence within an invoice item sequence), thus leading to nested source and target mapping scopes.
Examples:
  • /INVOIC/LIN[] denotes the repetitive LIN structure within the INVOIC message  
  • /INVOIC/LIN[]/QTY[]  denotes the repetitive QTY structure within the repetitive LIN structure within the invoice message (nested scope). 
  • /INVOIC/LIN[0]/QTY[]  denotes the repetitive QTY structure within first instance of the LIN structure within the invoice message  
Note: one could correctly argue that the message root (/INVOIC) is the outermost scope, so every scope within it is a nested scope.
Normally, within each scope, we have to map multiple fields, so we need to specify source and target field paths for these mappings.  Writing out the complete, absolute field paths that start from the message root (e.g., /INVOIC/LIN[]/QTY[]/C186/6060) provides better clarity although it carries come redundancy as the path for the containing repeating structure (/INVOIC/LIN[]/QTY[]).
Note that the braces in a field path such as /INVOIC/LIN[]/QTY[]/C186/6060 do not denote a whole sequence, but just the current instance of the sequence within the mapping scope.
To completely avoid ambiguity in this respect, one can use a dot notation to indicate "current instance" as is /INVOIC/LIN[.]/QTY[.]/C186/6060.   With this notation, /INVOIC/LIN[.]/QTY[] would represent the complete QTY sequence within the current line item (LIN) of the EDIFACT invoice.
Providing field paths relative to the current scope (e.g. just C186/6060 in this example) is more concise but forces the developer to "reconstruct" that complete field path. 
Except in cases where names are especially long and structures very deeply, using the complete path for each field is probably the best option.

Clear distinction between conditional mapping and sequence filtering

In the mapping sheet, these should be separate columns for Mapping Condition and Filtering Condition.
  • Mapping Condition: condition that must be satisfied  in order for the mapping to be executed at all.  This can apply to both simple field mappings (e.g., map only if the source value is not 0) and for repetitive mappings (e.g., map the target sequence only if another source field, not part of the source sequence, meets a certain condition)
  • Filtering/Selection Condition: condition that must be satisfied in the current source scope of a repetitive mapping in order for the current source scope (in other words, the current instance of the source sequence) to be used in the mapping.  For example when mapping the Sold-To customer from an EDIFACT inbound sales order (ORDERS) message, we need to select the NAD segment to map from based its qualifier value being equal to " BY" (= Buyer = Sold-To party).  The column for such filtering/selection conditions can be populated for any field mapping, not just in "sequence" rows.
Quite intuitively, empty mapping or filtering/selection conditions indicate that the mapping is executed unconditionally.
However, to reduce overhead in mapping sheets, the mapping condition "map only if source value is populated" is normally implicit,  If a null source value is a possible situation, distinct from the "source value not populated" situation, then such a mapping condition may have to be explicitly spelled out.

The source fields tested by a condition must be listed together with the source fields used in the actual mapping on the RHS of the sheet, in the mapping subsection of the target field.  As in the Transformation logic, also when writing condition one must uniquely identify the input value via the {n} notation  (denoting the nth source field within the mapping subsection for the target field).
Note that if the condition is very complex, it may be expressed via a  Complex mapping Function (see below).

Isolation of Complex Mapping Functions

When the mapping logic is complex and/or is needs to be reused in multiple places within a mapping sheet, it is good practice to factor out this logic by defining a named Complex Mapping Function with zero or more inputs and one or more output (usually one).
This keeps the clutter in the Transformation column of the sheet to a minimum and avoids redundancy when the logic is applied more than once.
Having specified the function, the analyst just has to reference its name (and the name of the output, it the function produces more than one) in the relevant cells in the Transformation column. If the function has more that one output, a notation such as <output name>=<function name>(...) should be used to indicate which output is used for the mapping.
Defining multi-output functions is useful to reduce redundancy where the logic would be largely shared across two or more functions (e.g.,  there is a complex procedure to determine a first target value, then a second target value is obtained by some computation from the first)

In case of multiple inputs, mapping function should be defined with multiple named input parameters that must be associated with  corresponding source field paths (the "actual" parameters) in the RHS of the mapping sheet.  This is a case of multi-input mapping as described above, so the transformation would be expressed as follows:
[<output name>=]<function name>(<parameter name 1>={n}, <parameter name 2>={m}, ...)
where again {n} and {m} represent the source fields in the n-th and m-th position within the mapping subsection.  The part [<output name>=] is only used in  case the function produces multiple outputs.
When the mapping function has a single input parameter the notation above is unnecessary and one can just write <function name>(.) in the Transformation cell (or <output name>=<function name>(.) if the function has a single input but more than one output).

Complex mapping functions are normally defined at the bottom of the mapping sheet, but if there are several of them and they are shared across multiple mapping sheets of a mapping workbook, then they may warrant a dedicated sheet or tab in the workbook.

It goes without saying that the definition of such function should be a specification for the developer, who can technically implement it in different ways as long as the specification is met.

Resulting mapping sheet structure

Hierarchical organization at row level

The different grouping of mapping sheet rows as described above lead to the following hierarchy at row level:
  1. Row sections for Repeating structures (with possible nesting of sections for sub-structures)
  2. Row sections for single target field mappings (with multiple mapping source fields)
  3. Rows for individual source fields
The mapping sheet section with the Complex Mapping Functions definitions is not part of this hierarchy and is below the mapping rows or in a separate tab.

Suggested columns in a mapping sheet

Based on the practices suggested above in the article, I suggest the following columns (from left to right) for a mapping sheet:
  1. Short Description or Source Data Glossary Link
  2. Target Field/Structure Path
  3. Mapping Condition
  4. Transformation  
  5. {n}  (source input number)   
  6. Source Field/Structure Path
  7. Filtering/Selection Condition 
For column #1, it is of course preferable to use links to a central glossary to document all or most data fields and structures used in integration work, as explained in previous blog article The importance of Data Glossaries.  Only one Description field is provided to supply general functional information, as it is not the purpose of the mapping sheet to define source or target data formats.

Beyond Mapping Sheets

While the vast majority of mapping specifications consist of Excel sheets, a case can be made for expressing these specifications in a machine-readable Domain Specific Language (DSL) that is still readable by humans.   A YAML-based DSL could be defined, for example.
Although Excel sheets can be exported as CSV for easy machine readability, the hierarchical structure of a mapping specification is not simple to express cleanly without using visual characteristics (borders for scopes, cell coloring, etc.)  which would be lost when exporting to CSV.

One use of a mapping DSL could be to automatically generate mapping code that would be completed and refined by the developer.
Automatic generation of good mapping code is a challenging topic not just technically, as it will place clear responsibility for formal definition of the mapping logic on the analyst, making the analyst in effect a meta-programmer.  Optimization of the generated code is also an issue unless the analyst intelligently factored out shared logic.

Another, more realistic application, would be having software tools process such formal mapping specifications for impact analysis in Change Management.   





 

Thursday, March 12, 2015

ESB Logging (part 2)

This second part of my article about ESB logging describes some techniques that can be adopted to meet the logging requirements described in part 1.

Different techniques and technical choices can be applied at each level of the following generic logging architecture, which should be self-explanatory:
The role of the Logging Transport is to decouple the Logging Publisher from the Logging Data Store and  is an essential one.  A tightly-coupled solution, in which the calling code would have to synchronously wait for data to be persisted to the Logging Data Store before carrying on with its work, would be  inefficient.

Note: in the picture above the logic that serves the UI to browse and search into the Logging Data Store is not explicitly represented;  however this logic must exist and be usable in order to access the data in the store in an efficient way.  This article does not treat this part of the solution.



Details of Logging Architecture Layers


We will now will deal with the three layers in sequence and how they can contribute to providing the  logging functionality we need.

Logging Publisher

The Logging Publisher component must be directly callable from any ESB component via a simple API.   It is imperative that:
  •  The logging call be simple. If not, developers will need too much effort to place logging calls throughout the integration code.  The simplicity of the logging call largely depends on the number of its mandatory inputs.  In most designs, the severity level and the log message  should be the only mandatory inputs that the developer is explicitly required to supply.
  • The logging call be efficient, imposing the minimum possible overhead on the calling code.  The Logging Publisher must decide in the minimum possible time whether to hand over the data to the Logging Transport and, if that's the case, the handover itself must be as quick as possible.
In addition the the "log"  API,  a "flush" API will be necessary if the Deferred Logging technique (described at the end of this article) is adopted.
These APIs constitute a very simple contract between the logging solutions and its clients: the logging solution can freely evolve as long an the contract is not broken.


Explicit and Implicit inputs to logging calls

Part of the logged data are passed explicitly by the caller:
  • Severity Level of the logging entry (e.g., DEBUG, INFO, WARNING, ERROR, FATAL)
  • A Log Message.  This can be a potentially long string of text, which can be include dynamic sections.  It is not, however, to be used to store a whole message (which is the purpose of the document attachments mentioned below)
  • A Log Id to tag the precise location in the code from which the logging call is made; use of this identifier may be optional or mandatory, depending on the logging design
  • One or more document attachments (representations of data objects treated by the ESB component, like for example a service request payload); the persistent storage of documents associated with logging entries is often called archiving and is treated separately later in this article.

Other data elements are normally extracted implicitly and automatically by the Logging Publisher function:
  • A logging time-stamp
  • The name of the ESB component calling the Logging Publisher function
  • The server and port associated with the execution of the ESB component
  • The user id under which the ESB component is executing 
  • The consumer id for which the ESB component is executing; this may not be present in case the ESB component does not implement a service (e.g., in the case the component is of the polling kind); also, depending on the authorization design, the consumer id may be the same as the user id mentioned above
  • Tracking or Correlation Identifiers that unequivocally relate to the service execution and correlate all logging entries originating from this execution (more on this later)
  • Optionally, an incremental sequence number that sequences logging entries generated by the same execution (i.e., that have the same correlation identifier value)
In order to retain all the above information across logging calls from the same execution the Logging Publisher must be stateful.  To achieve this, two main approaches are used:
  1. Deferring logging state to the client of the Logging Publisher API (i.e., the integration code), by having logging state information "carried around" throughout the service implementation logic in some kind of standard "control header" or "execution context" structure.
  2. Caching logging state.
Many times, the first approach is preferred for its simplicity (and marginally better performance).
However, relying on the integration code to always possess a valid reference to context information is not always simple, especially in the face of exception situations.  This requires very careful coding and good QA on the produced integration code.  
The other approach, which is to be considered a best practice, is described next.

Caching logging state 

Caching logging state in memory, using caching solution such as Ehcache,  makes the Logging Publisher component more robust as it does not rely on client code to retain state on its behalf. 
A memory cache (which could be local to the runtime environment or distributed) essentially behaves like a big hash table holding data objects can be accessed in near-constant time based on their corresponding keys.  In many cases, the keys are simply String objects and the values are any kind of serializable objects.
The logging state can be easily wrapped into one such object, and once it gets stored into the cache it can be retrieved reliably from any point of the service or component logic later on. The essential precondition for this is that the corresponding key must be:
  • A unique identifier for the execution instance
  • retrievable in a simple and efficient ways from the ESB runtime
Most ESB products have some form of unique execution identifier.  For example, the Software AG webMethods ESB maintains a globally unique root context id for each synchronous thread of execution.  In BPM contexts, the internal process instance id  (also normally a GUID) can be used for this purpose.

To prevent memory leaks, the ESB service/component/process logic must ensure that the logging state object is purged when a runtime execution instance (service, component, or process) ends, whether successfully or with an error.  The use a of a memory cache rather than a simple hash table is helpful here, as cache products usually allow the specification of a Time-To-Live (TTL) for cache entries.  Thus, even in the unlikely cases in which the ESB logic is not able to purge an entry, it will get purged automatically after expiration (i.e., when its age exceeds its TTL).

It goes without saying that the TTL must be tuned in order to exceed the maximum potential lifetime of any runtime instance.


Conditional logging based on Logging Level

When the Logging Publisher API is called, it will pass its data (received from the caller + collected automatically) to the Logging Transport  in order to persist it.

The decision to persist (or not) a logging entry normally depends on its Severity Level (as passed by the caller) relative to the Severity Level Threshold set at system or ESB component level.
It makes sense in most cases to immediately deliver to the Logging Transport any logging entry  whose severity exceeds the applicable Severity Level Threshold, so there is the shortest possible lag time between the generation and the persistence of these events. 

For better supportability, it is important that Severity Level Thresholds can be set into a data store that is accessed dynamically at runtime, so logging can be "turned up" and "turned down" on the fly by application support staff. 

The logging thresholds (global or by component) must be configurable in a persistent store (such as a DB), but must be accessed very frequently, every time the Logging Publisher function is called.  To avoid excessive overhead, caching is essential.   Application Support could use a custom UI to to alter the logging thresholds in the persistent store, and then force a refresh/reload of the cache to have the new values take effect in real time.

In most designs, logging events having severity lower than the applicable Severity Level Threshold are just ignored, but is some cases they may still be processed later in case of an error (see section Deferred Logging at the end of this article).
The association of sequence numbers to logging entries allows in any case to chronologically sort the entries from a certain execution instance, regardless of which entries were persisted first.


Logging Transport

Generally, once the Logging Publisher invokes the Logging Transport for a logging entry (or a set of related logging entries), the data passed to the transport is considered as committed and will be eventually persisted.  

The Logging Transport must:
  • expose a very simple API towards the Logging Publisher,
  • internally operate asynchronously with respect to the Logging Publisher . 
Once the Logging Publisher decides that an entry is to be persisted, it simply calls the transport API to enqueue the entry  and then returns very quickly to its caller.
The processing from the Logging Transport to the Logging Data Store is completely decoupled.

In the simplest implementations, the Logging Transport just consists in a wrapper around the access protocol of the Logging Data Store.  For example, with a relational database used as Logging Data Store, the Logging Transport API would just wrap a DB call (or set of calls), executed in separate thread.  This solution shows its shortcomings in the face of peaks of logging traffic, where the DB pools size is exhausted and logging entries must be discarded (with loss of potentially important logging information).

A robust solution for the Logging Transport should use persistent queues for temporary storage of log entries to be persisted into the Logging Data Store.   This provides the necessary buffer to cope with variable loads.
The queuing solution used by logging should be totally independent, both architecturally and infrastructurally, from the asynchronous messaging solution used for business critical messages (as part of the integration logic).  A very high logging volume should not compromise the throughput and latency of business critical messages, and certainly we want to completely avoid risk that a flood of log entries could bring the main messaging system to its knees.

The queuing infrastructure used by logging should be lightweight and low-latency, such as for example ZeroMQ, or Software AG / Terracotta Universal Messaging.
These are interesting possibilities for advanced use of the Logging Transport, such as:
  • Holding entries in queue until a "flush" methods is called, and releasing them to the Logging Data Store only then (this is required by a particular implementation of the Deferred Logging technique,  described at the end of this article).
  • Aggregating related entries from the queue before persisting them into the Logging Data Store.  This is useful above all when a NoSQL Document DB is used as Logging Data Store, in order to bundle multiple related entries into a single document.



Logging Data Store

As already stated in part 1, the logging data store must be logically centralized in order to be effective, even if it physically distributed. 
The most common way of implementing a centralized data store for logging is via some kind of database.  ESB tools mostly use relational DB tables for their standard logging solutions, but very often custom logging data stores are designed.  These designs range from a simple set of relational tables to the use of noSQL, document-oriented databases such as MongoDB.

What matters, really, is that the data store design is optimized for:
  • insertions
  • queries by a given set of indexed fields
The latter property is crucial to enable the efficient retrieval of related logging entries. Relationships will be based on the correlation identifiers described later, and possibly also on additional, configurable "classification keys" that may have functional meaning.


Logging Data Store and Message Archiving

If we take away the message archiving requirement, that is, the inclusion of potentially big message payloads into audit logging, then it makes sense to simply use a physically centralized logging data store.
Things change a bit when message archiving is is scope (as it is in most cases).  Having distributed processing nodes send big messages (which could be sometimes in the order of tens of MB or even larger) to a centralized store over the network is inefficient.

A more efficient approach is to store in the central DB everything with the exception of message payloads, and keep these locally at the node that generated them.  The centralized DB will hold location metadata about these distributed payloads:
  • hosting node 
  • unique local access path (typically including a globally unique file name)
The rationale behind this choice it that is is very efficient for the processing node to store a payload locally, while access to these payload will be very infrequent (only when the user browsing the logging data store "drills down" into the payload part of a given logging entry).

The downsides of distributed archiving are mainly:
  • the need for a technical service at each node that delivers a local payload on demand
  • the need for an agent at each node that manages payload retention to manage space according to retention policies (deleting old payloads)
In presence of massive amounts of messages to be archived, the use of the Hadoop Distributed File System (HDFS) can be considered.



Specific logging functionality

  

Correlation

This is a cornerstone aspect of each logging solution, which must be able to handle multiple correlation identifiers (all of which should be GUID's).

Local Correlation Identifier

This identifier  is generated at the beginning of each thread involved in the execution of a service.
If the service is not a simple synchronous service, then each  invocation originates more than one value for this identifier (one value for each synchronous execution context).
Only logging entries generated within the same synchronous execution context share the same value for this identifier. This local correlation identifier can be normally directly derived from the ESB runtime context (e.g., root context Id in webMethods or session id in Mule).

Global Correlation Identifier

This is is the most important correlation identifier. It has the value of the Local Correlation Identifier of the first ESB component that is executed as part of a service invocation.  The service logic must then propagate it across the whole invocation, regardless on whether the components involved are loosely or tightly coupled. This is achieved technically via protocol properties depending of the implementation of the service (e.g., SOAP headers, HTTP headers, JMS properties, etc.)
The ubiquitous availability of this correlation identifier throughout the execution chain is what allows all logging entries to be effectively correlated.
Finally, it is a best practice for services to return the Global Correlation Identifier as part of each service response, allowing consumer application to create a cross-reference between the ESB Global Correlation Identifier and any internal identifier (for instance the one used as "Tracking ID", described below).

External Correlation Identifier ("Consumer Tracking ID")

Technically, this identifier may be handled exactly like the Global Correlation Identifier, but there is an important difference: it is generated externally by the service consumer, not internally by the ESB. Its main purpose is to allow tracking via a "source" identifier that has meaning in the context of the consumer application.  It can thus be called the  "Consumer Tracking ID"

Even though the ESB may enforce that each consumer supply a Tracking ID (rejecting service requests that are lack this ID), the ESB cannot realistically enforce its uniqueness.  For this reason, it cannot normally be used in lieu of the Global Correlation Identifier.  Thus, rather than propagating the External Correlation Identifier throughout  service execution, it is sufficient to simply associate it with the  Global Correlation Identifier in the Logging Data Store.


Logging and BPM

When using a BPM solution to build Orchestrated Task Services, custom logging is usually less necessary as most BPMS tools offer ample process audit and monitoring capabilities.  For each process instance, the sequence of states (i.e., sequence of traversed process steps) is tracked, usually with the possibility of logging selected input and outputs for process model steps.

Virtually all BPM engines natively support the definition of a Process Correlation ID to join events to running process instances.  This correlation identifier normally has functional significance and is to be considered the equivalent of the External Correlation Identifier discussed above.

Each process instance is assigned a unique Process Instance ID (PID) by the BPMS, which is to be considered the equivalent of the Local Correlation Identifier (although it is not in general associated with a single thread of execution).   

For the Global Correlation Identifier, a similar logic can be followed as for non-BPM integrations, by adopting the PID of the "root" process instance and propagating this value to all sub-processes, regardless of whether they are detached or not.  Detached sub-processes will have their own instance ID (= Local Correlation Identifier), but can still share the propagated Global Correlation Identifier (and the Process Correlation ID = External Correlation Identifier).

In case the out-of-the-box functionality of the BPMS does not allow easy audit logging, the techniques described in this post are perfectly applicable to automated business processes,  with the equivalences just mentioned.


 Deferred Logging

In Production environments, it is common practice to log only an entry at the beginning and one entry at the end of the execution of a service or component is case of success (typically, archiving the input and output message at the same time).  The Severity Level Threshold is normally kept to "ERROR" or "WARNING" since the logging system would be swamped if the level would be set to "INFO or "DEBUG" across all services at typical Production volumes.

This normally limits the logging information available when an error occurs to two entries: the "execution start" entry and the "error occurred" entry, with no information in between.
When a component or service instance experiences an error,  it would be useful if the maximum amount of logging information generated prior to the error were made available irrespective of the Severity Level Threshold.
This is valuable even for a BPM solution, which intrinsically provides better tracking compared to a non-BPM solution, as it provides detailed visibility of what happens within process activities.

The technique called Deferred Logging consists in temporarily holding all logging information (of all logging levels right down to "DEBUG"), until the execution of the service/component/process instance terminates, at which point a "flushing" action is triggered:
  • If the instance terminates successfully, the pending logging entries are filtered based on the current Severity Level Threshold .  Only the entries meeting or exceeding the threshold are flushed to the Logging Data Store (e.g., if the threshold is set to "WARNING", then "INFO" and "DEBUG" entries are discarded and not persisted into the store).
  • If the instance terminates with an error, then all pending logging entries  are flushed to the Logging Data Store, this storing a complete "history" of what happened prior to the error.

There are two basic approaches to implementing Deferred Logging:
  • The Logging Publisher component caches the pending logging entries in memory (normally along with the rest of the logging state), and performs the "flushing" action at the end of the instance execution (this is triggered via a special Logging Publisher API method).
  • The Logging Transport is designed to not deliver to the Logging Data Store immediately, but to hold logging entries pending a flush action (or their expiration).  Here, also, the flush action is triggered at at the end of the instance execution via a Logging Publisher API method.  This second option is more complicated but more robust, as pending entries are kept in persistent storage rather than in a memory cache.

Conclusion

Once a flexible logging API is designed, its implementation (the logging solution) can be made more and more sophisticated over time.

The advice is to start small and evolve, based on the supportability requirements that are specific to your organization.
Depending on the requirements, some logging features could be skipped in the first implementation, but it is my opinion that the 3-tier logging architecture presented here (with asynchronous Logging Transport) is a good foundation for a robust and scalable solution.