Wednesday, October 22, 2014

ESB logging (part 1)

Logging is one the fundamental cross-cutting concerns in any software system.

This three-part article will discuss the ESB logging functionality in part 1, will propose some techniques to address these requirements in part 2, and finally address in part 3 the implications of logging in a BPM environment (i.e., when services are orchestrated by a BPMS). 

Functional and technical requirements of ESB logging

In ESB systems, there are several reasons why we need to do logging, the main ones being:
  1. To keep a functional audit of integration activities (sometimes called "transaction logging"); this is essentially a traceability requirement.
  2. To have detailed troubleshooting information available in case of errors or malfunctions
  3. To collect performance information, which helps us identify bottlenecks in our service implementation 
These three points can be considered as very  high level functional requirements for ESB logging.   As always, depending on the specifics, each of these translates into technical requirements that must be appropriately addressed.  The following subsections will go more in depth discussing these three macro-requirements.

Functional audit

The creation of a functional audit trail is normally mandated by business policies and/or by external regulation.  In most cases, it is required to include a copy of of the payload of the service (API) requests and responses.  In other cases, not only the payload but also all control information (in the form of SOAP headers for SOAP services, HTTP headers and status code for REST services, JMS properties for asynchronous JMS services, etc.) must be logged. Logging that includes payloads is sometimes called archiving.
Strictly speaking, the usage of the term "archiving" is not appropriate as the term should be used only in those cases where the system keeps persistent copies of the payloads in normal operation, which is typically the case for B2B gateways.  After a set period, these messages are moved from the "online" data store to the "offline" archive store. 

Looking at the audit requirement more in detail, we need to define:
  • Where to log (audit points)
  • What to log (logging content)
  • How long to keep the logs (retention policy)

For discussing the first point (perhaps the most important of the three), consider the diagram below that shows a rather simple example of an ESB service implemented via a single ESB component, composing native services of two different backend systems but not composing other ESB services:



 The diagram shows six different audit logging points:
  • two at the ESB service endpoint, facing the consumer (logging its request and the response delivered to it) 
  • six facing the backend systems, logging requests and responses exchanged between the ESB (in a service consumer role) and the native endpoints of the backend systems.
As a minimum, audit logging must be done at the two consumer-facing endpoints, but for comprehensive logging all messages that enter and exit an ESB component must be included:
  1. Inbound request from service consumer
  2. Outbound request to native backend service
  3. Inbound response (or fault) from native backend service
  4. Outbound request to composed ESB service
  5. Inbound response (or fault) from composed ESB service
  6. Outbound request (or fault) to service consumer
Note: by "fault" I here mean any structured set of error information, not necessarily a SOAP fault.

Clearly, logging points of type 2 through 4 are due to service composition, so the number of such logging points is proportional to the number of service operations being composed.

It is necessary to correlate the log entries generated at each audit logging point, which can be done in two ways within a component:
  • by logging independently at each audit logging point (generating a separate log entry at each point), but having the log entries share a common unique identifier
  • by grouping the log information generated at each audit logging point into single log entry which is written when the service completes execution (either with success or with a fault).  This can be called delayed logging.
The choice between minimal audit logging (consumer request and response only) and full audit logging should be configurable at the level of service and service operation (or resource / sub-resource level for a REST service), with the usual pattern of configuration inheritance and override between parent and child resource (the setting for service is inherited by service operations unless specifically overridden at operation level).


The second point relates to the content of the log entries.  A minimal set of information fields for the logging function will be suggested in part 2 of this article, but here the important point to remark is that information logged at each log audit point may be heterogeneous, and greatly vary in size.

Often the information is already in textual form (e.g., XML payloads, JSON payload, SOAP headers, HTTP headers), but in some cases we have binary information (e.g., scanned documents).
Even when information is textual, it may not arrive as a String at the ESB endpoint (example: posted HTTP body received as a Java InputStream object), and that mostly depends on the particular ESB tool used.
Sometimes message payloads can be very large, and this must taken into account as well.

Thus, the format and size of the data to be logged has a significant impact on the design of ESB logging, which must be flexible in this respect without introducing excessive overhead.

The third point relates to the retention period of the entries created into the audit log.  Here, like in the case of the determination of audit logging points, it should be possible to configure the retention period hierarchically by service / operation (or parent / child resource), so that information related to more critical operations or resources can be retained longer.   A significant plus, on top of this functionality, would be to allow a retention policy that also takes into account the identity of the consumer.


Technical logging

While the preceding subsection was about audit logging (that is logging that must be done at every service invocation due to functional requirements), this section briefly discusses the additional technical logging that can be done in addition to audit logging to provide useful information about service behavior, which can be essential to troubleshoot obscure malfunction detected in the Production environment.

First of all, let's clear the field from the rather common misconception that mixes together such technical logging and error handling

Technical logging is NOT the same as error handling.  Error handling is ESB logic that must be executed in all cases where an error occurs that requires manual recovery (part of an Incident Management workflow), or that must anyway be notified to a human user (even when the error is recovered from automatically), and the error/fault response to the service consumer is not sufficient.

On the other hand, technical logging, when enabled, is done from  ESB components independently of whether error conditions are happening or not.  It simple traces the execution logic within components at a fine-grained level of detail, providing detailed insight into what happens between the audit logging points.

This does not mean that the implementation of logging can never share anything with that of error handling.  In many ESB environments, for example, designers have decided to make use of the Log4J framework as part of both their logging and the their error handling solutions.

The first consideration about this kind of logging is that, for performance reasons, it should be completely turned off in normal Production operation.  Such logging is normally enabled in environments that support Unit Testing and System Integration Testing, but normally disabled not just in Production but also in every environment that must closely mirror Production conditions, like in environments where Performance Testing happens.

Still, enabling technical logging in a very granular way (e.g. per component or component/consumer combination) can be occasionally invaluable to troubleshoot incorrect behavior in Production.  This must be done, of course, very carefully, due the the performance impact in case a component implements a service that is heavily used.

It is imperative here to have the capability of hierarchical, fine-grained configuration, so the appropriate level of logging can be set very flexibly.

Normally, technical logging should not include payloads but just short dynamic log messages that can contain the values of selected variables in the component's logic.   

An interesting option for technical logging (in between the "on" and "off" settings)  is that of conditional logging:  technical log entries are logged (at the end of the execution of a component, that is with delayed logging), but only if the component experienced an error.  By choosing this option we recognize that detailed tracking of successful service executions is superfluous, which is almost always the case.   On the other hand, "mass" errors, typically due to unavailability of a backend resource, will produce under this option a very large amount of logging information that is utterly redundant (since all service invocations failed for the same reason).

Performance logging (profiling)

Since logging entries (both audit and technical) are associated with a timestamp, it is obvious that logging can be used to collect performance statistics (profiling) for service operations.  The relevant log entries for are only those logged at the audit logging points, for two reasons:
  • The technical logging points are normally too dense within a component
  • Technical logging in Production and during Performance Testing introduces an overhead which is normally unacceptable and that in any case will bias the profiling results (the act of measuring affects the measurements).
The main statistics (min, max, average, standard deviation, etc.) that can be collected are:
  • those related to the distribution of service operation invocations (by time period, consumer, etc.)
  • those related to the total service operation execution time, within the ESB component exposing the operation
  • those related to the execution time of composed services; these, combined with the total service execution time statistics of the services being composed, can provide statistics for message transit times between components (significant when the services being composed are dispersed geographically)



Log entry consolidation, correlation and sequencing

The diagram used above to illustrate the audit logging points shows a very simple situation where all logging points are reached sequentially within the execution thread of the (only) component.  In this case, correlation and sequencing of the produce log entries is trivial.

The situation changes a lot when we have to deal with complex service compositions, distributed ESB components, asynchronous interactions, clustered environments, and services orchestrated by BPM.

Even with a rather simple synchronous composition of ESB services, it not guaranteed that these services run in the same thread:  a "local" service invocation in a clustered environment may executed the service in a cluster node other than that where the caller service executes.

In general, anyway, we must assume that composed services are implemented by components that are distributed in the network, when they are invoked asynchronously, when they are invoked asynchronously via a message broker (for example,  in the case of SOAP-over-JMS invocations) and when they are invoked as business process steps within a process instance.

Each component involved in a service execution makes creates multiple logging entries in the Logging Data Store, and all these entries must all be mutually correlated and sequenced in order to give a clear picture to a person browsing the log.

An example, purposefully a bit involved, is given in the following diagram:


In the picture above a distributed service composition combines functionality from 4 different backend systems.
A main service (implemented on component 1) hosted on machine A directly or indirectly invokes three other services (one hosted on the same machine, and the other two on different machines).  The service hosted on machine C (component 4) is invoked asynchronously through a messaging queue.

Each service component performs logging at its audit logging points and perhaps some components will also have technical logging enabled.   As a result of logging entries can originate from different machines and must be correlated and consolidated  into a single place (the Logging Data Store).  That, by itself, can pose a concern: we can realize upfront that a trivial logging implementation like, for example, using a separate synchronous INSERT to a central database for every individual log entry can be inefficient and hurt our service response times.
 
The Logging Data Store does not need to be physically centralized, but it needs to be centralized logically, allowing a User Interface to access all of its information no matter where it is physically stored.

In order for the audit and logging information to make sense to somebody who browses and analyses the Logging Data Store,  log entries originating from different components that relate, directly or indirectly, to the same invocation by the External Service Consumer must be:
  1. Logically grouped together
  2. Accessible as a whole in a location-independent fashion (i.e., even when not all related entries are store in the same physical data store)
  3. Organized hierarchically based on service composition logic  (e.g., showing the execution of component 3 as a subordinate of the execution of component 2, which is in turn a subordinate of the execution of component 1)
  4. Sequenced correctly (regardless of deviation of the system clocks of the different machines).  This is necessary for proper profiling.

Part 2 of this article will be discussing techniques to help us achieve without affecting (too much) the performance of our services.




Sunday, October 12, 2014

Message Payload Content Models: specific or generic (part 2)

This post continues the previous one (part 1), and presents further alternatives for modeling payload data structures for ESB event or services, using XML and XML Schema.

The second alternative of part 1 was called "Entity-specific definition and derived types", and relied on the xsi:type attribute for two purposes:
  1. To define a restricted type for a generic entity-related XML Schema type (PurchaseOrderType in my example), which can be used for validation and otherwise
  2. To refine the semantics of the message in the context of the Business Object (the Purchase Order), for example specifying whether it represents a release, a cancellation, a modification, etc.
The payload looked like this:

This approach has several benefits, but its biggest disadvantage is that, like the "Fully specific definition" approach, it still requires the definition of a large number of schema types:  the generic, "canonical" type (normally by Business Object), plus a potentially large number of restricted types (by BO/event type or by BO/service operation).

The next (more generic) approach to be examined still uses an entity-based "canonical" definition, but without any derived types.


Entity-specific definition with separate validation

we want to use the generic "canonical" type without the restricted schemas used under the previous approach.
Not using anymore the standard xsi:type attribute (which was necessary to identify the restricted schema type),  the schema must now define an explicit custom attribute to identify the semantics of the message.  In this example, the attribute is named eventType:



















Below is a sample event payload conforming to this schema, for a Purchase Order cancellation event:











As we can see, the message is practically the same as for the previous approach, just with a different attribute and without the XML Schema Instance (xsi) namespace declaration.

All considerations regarding mapping that apply to the previous approach also apply here.  All integration logic that has to do with purchase orders will use instances of the common PurchaseOrder document, mapping in each case those optional elements that pertain to the functional context at hand (the functional context identified through the value of the eventType attribute).

However, under this new approach, we do not have the event-specific derived types anymore.  In our example, we no longer have the PurchaseOrderCancellationType definition that enforces  that  CancellationDate and CancellationReason as mandatory elements when eventType="PurchaseOrderCancellation".  The inability to validate the payload based on semantic context would a serious functionality gap.

To fill this gap, we need ways to specify validation logic separately from the schema without writing custom validation code in the ESB layer (which is awkward and not very maintainable).  We need is to specify constraints and validation rules in a declarative fashion.

In our example, the constraint is simple: the  CancellationDate and CancellationReason elements must exist if the eventType attribute of the PurchaseOrder element has a value of "PurchaseOrderCancellation".   Nevertheless, there are many several other types of constraints that would be useful for validation in a semantic context, which cannot be always expressed using XML Schema.  They are called co-constraints in the literature.  Even more complex validation constraints can be those of an algorithmic nature (for instance,  checking that for each order line the value of its GrossAmount element is the sum of the values of its NetAmount and TaxAmount elements).

Granted, one can easily get carried away and specify all kinds of complex business rules to be enforced, which brings to the important debate on whether or not (and to what degree) it is appropriate to enforce business rules at the ESB layer (rather than at the level of provider applications and resources).  I'll leave this discussion for another blog post and focus on the simple need of validating mandatory elements by semantic event type, which is a totally legitimate need at the ESB level.

XML technologies come to the rescue to check co-constraints and algorithmic constraints.  Technologies like Schematron and Relax NG have been available for several years in the form of libraries that can be easily used by an ESB product.  However, the most common technology to implement this logic is XSLT (on which by the way most Schematron implementations are based).

A simple XSLT stylesheet to check the presence additional mandatory elements (CancellationDate and CancellationReason) for a PurchaseOrderCancellation event could be the following:









Its logic should be clear:  if the PurchaseOrder document contains the two elements, then the output resulting from the transformation will empty (= validation OK), otherwise the output will contain one or more validation errors.
The ESB must, after parsing the document and  validating it against its canonical schema (which defines which elements are mandatory in all scenarios), dynamically invoke the appropriate validation stylesheet based on the runtime value of the attribute that defines the semantic context (eventType in our case).

The dynamic invocation of an XSLT transformation is simple to implement in most ESB tools and has the following advantages compared to using restricted XML schema types:
  • XSLT validation is much more powerful than XML schema validation to express co-constraints and can even express algorithmic constraints if really required
  •  The stylesheets can more compact than XML schema restricted type definitions (which must repeat all mandatory elements of the base canonical type)
  • XSLT stylesheets can be changed and deployed independently of ESB code and XML schemas (which in most ESB tools get "hardwired" into the ESB code at design time), bringing more flexibility to change validation rules without re-deploying whole ESB components
The disadvantages  essentially consist in some additional complexity (XSLT learning curve) and the performance hit of the XSLT transformation, but this approach is anyway a viable one in most cases. 


Fully generic payload definition

In business environments where data structure definitions must be highly flexible (for example, in R&D departments),  people tend not to want to be bound to fixed schemas for message payloads, but instead favor a fully generic payload structure that can cover arbitrarily composed data entities.

It is true that XML Schema allows open-ended definitions via the <xsd:any> construct, but that has limitations (it cannot be preceded by any optional element definitions)  and is not supported by many ESB tools, so it is not considered here.

A very simple but extreme way to implement a generic data structure is to use a collection of key value pairs (equivalent to a hash map) qualified by a semantic context (eventType).  A simple definition of such a generic structure in XML Schema could be the following:





















A sample instance of this schema (again for the PO cancellation case) is:





Now, the schema definition is no longer entity-specific, and the value of the eventType attribute must convey the complete semantic context of the message.

What are the implications of this approach?

On the plus side:
  • There is just one very simple canonical schema, which does not even need to be versioned (as opposed to any other type of schema)
  • All event and service interface just use the common generic schema
  • Mapping from instances is  easy using XPath (looking for just for the elements that are appropriate in the functional context and ignoring the remaining content)
  • Dynamic validation based on functional context can still be done using XSLT (or Schamatron, or Relax NG), like for the entity-based approach above.  In XSLT we could have (note that now all mandatory elements must be validated via the stylesheet):

 On the other hand, the fully generic approach also has disadvantages :
  • Since service contracts and event definitions all use a common generic schema, they are not self-describing in terms of the data they exchange. 
  • Most importantly, components using a generic schema do not have message formats enforced at contract level, but everything is enforced at implementation level.   Service consumers, service providers, and event subscribers must be aware at runtime of what data model is coming their way.  In practice a version attribute is mandatory (in addition to eventType) so that a message recipient can apply the appropriate validation and mapping based on the combination of eventType and version
  • Mapping to a fully generic format is less convenient and cannot leverage the drag-and-drop mapping facilities available in virtually all ESB tools.  One needs to call a custom built map(key, value) helper method to add each KVP element to the payload being mapped
  • Mapping complex, hierarchical documents is more difficult.  In practice the easiest approach is to define hierarchical keys (e.g., under a PO document "/items[10]/productId" could identify the product ID of the 10th line of the order)
These points should be carefully taken into consideration before going with the fully generic approach: do you really need that degree of flexibility?


Conclusion

The choices made in the area of message modeling have a major impact on your SOA and your ESB designs.  Be sure to consider all pros and cons in the context of your enterprise (degree of agility required, business processes, existing data models, features of available tools).


Saturday, October 4, 2014

Message Payload Content Models: specific or generic (part 1)

When modeling payload data structures for messages, events, services interfaces, or API interfaces, there are some hard choices to be made, which have implications on simplicity, reuse, readability, performance, ease of validation, and ease of data mapping.
A very basic and important choice has to do with the specific vs. generic nature of the schemas we define.

A  payload schema that's highly specific defines the exact set of data fields required by a service or API signature, and in the case of an event it defines the exact set of data fields that are relevant for that event.
This being said, the actual degree of specificity of the payload schema depends on how semantically specific the service/API/event is:  there is a big difference, for instance, between a CustomerUpdated event and a CustomerBlockedForBadCredit event.  The content model for the first event may potentially include a very large set of customer master data elements (as it is not a semantically specific event), while the content model for the second could just contain the identification of the customer, and credit related data.

On the other hand, a completely generic schema would just define just an open, flexible set of key/value pairs whose composition varies based on the service/API or event type it is used in, with no semantics implied by the schema definition per se.

This  article (divided in two blog entries, of which this is Part 1) will consider a number of options, going from the most specific to the most generic side of the spectrum, using Purchase Order (PO) events as examples.  For example, we want to model on our ESB an event carrying the information that a certain purchase order has been released for sending it to the vendor, or another event that the PO has been cancelled (before the vendor has shipped it).

In the examples of both Part 1 and Part 2 XML data models will be used, omitting the use of namespaces in order to make the examples simpler to read.  For the same reason, technical envelopes with control fields have been purposefully omitted, although real ESB architectures normally define custom envelopes.
However, element field names are standardized across the examples (as they should be, see article The Importance of Data Glossaries).
Despite the fact that I use XML and XSD here, high level considerations may apply to other formats as well. 

In this first part of the article, I present two alternatives, which both lean on the "specific" side, leaving the more generic alternatives to Part 2.


Fully specific definition

A straightforward approach to model a PO Release message would be to use a payload like this:

The root element name in the event payload identifies at the same time the business object to which the event happens and the nature of the event. This is very specific, although the payload does not include much data in the event message beyond the identification of the PO.



To enrich the event with  detailed approval data (e.g., for each approval level:  approval date, approver user id,  approver remarks) it would be necessary to call query service (Content Enricher pattern).  The choice between infomation-rich event payloads and "essentials" events would be the subject of a separate discussion.

We would of course define an XML Schema  resource in our SOA repository that would include something like the element definition below:




This is excerpt from complete XSD file, with type definitions simplified (in practice most of them would refer to separate simpleType definitions).






Here all child elements of PurchaseOrderSendReleaseEvent are mandatory, as we intend this information to be supplied by the event publisher for every single event of this particular type.  The schema could of course define additional optional elements, but nevertheless the content model is tailored to the specific combination of business object and event.
A similar schema for the request message of an update service operation could enforce that all required inputs are present for the operation, as they correspond to mandatory child elements.

Let's see how this approach fares against the criteria listed at the beginning of the article.

The approach is clearly extremely simple, it has very good readability and ease of validation (it is clear from the schema definitions which elements are mandatory in which functional context, and standard XSD validation can be readily applied).

The main drawback is that it scores low in reuse:  it is possible to share the simpleType definitions for the individual data elements but nothing more.  It is necessary to define and manage many schemas, due to the high number of Business Object / event combinations  (for events) and specific service operations (for services).
From the point of view of data mapping, mapping is straightforward, but necessarily many similar mappings need to be maintained for each individual schema.  Here, too, there is no reuse and maintenance is  heavy.
Performance is good as only data related to the specific event is included into the document.


Entity-specific definition and derived types

A first step in the direction of having a more reusable payload is to use a common root element that only reflects the Business Object affected, while the event (or service operation) could be expressed via an attribute value.

With XML, we can take advantage of the standard xsi:type attribute of the XML Schema to indicate the event type (avoiding the use of a custom attribute for this), and we can have the two payloads instances below for a PO release and for a PO cancellation, respectively:













Here, although the two payloads are composed differently, we would like to use a common schema for mapping and ESB broker publication, while retaining the possibility of validating each payload against a restricted schema that is specific to the event type.

The schema definition could contain a generic PurchaseOrderType, having as its mandatory elements only those elements that are populated in every possible PO integration scenario (e.g., PONumber, CompanyCode, POType, VendorNumber), and the other elements defined as optional, as they may be present in one particular scenario but not in others. 


NOTE: this being an example, only three optional elements are shown.  But in a real case there would be a very large number of optional elements.

 





From the above generic or "canonical" definition of purchase order, we can derive restricted schema definitions for each PO event or service based on the generic schema.  For example, for events SendRelease and Cancellation:



 

 

 

 

 

 





Compared with the fully specific schema, this approach doesn't seem to have any advantages at first sight:  it still uses a restricted complex type definition for every PO event or service, plus the added generic complex type PurchaseOrderType

However, in most ESB implementations (e.g., Software AG webMethods), it would be possible to perform mapping and broker publication against the generic (canonical) schema definition and additionally:
  • have broker subscribers based on the same generic schema with filters based on xsi:type
  • "downcast" an instance of the generic type into an instance of a restricted type via a single map operation (without having to map field by field, which is error-prone and inefficient)
  • dynamically validate a payload instance against a restricted type
  • extend the generic (canonical) type with additional optional fields with no  impact on the definition of the derived types,  the event / services definitions using the derived types, and the existing mappings between generic type and derived types
Thus, this solution is more flexible than the fully specific definition, while keeping similar levels of readability and performance on the wire.  Schema extensibility is good, as it is possible to decouple extended versions of a restricted type from extended version of the corresponding base type (as long as the derived type version uses elements that are also defined in the base type version, of course).

The concern remains about having to manage a large number of derived schemas, but with an important difference.
The availability of the generic (canonical) schema definitions makes it much easier in this case to automatically generate the derived schemas based on simple specifications.  It would not be too difficult to develop a program that reads the XSD for a generic schema from the SOA repository and also read some kind of data store (as simple as a property file or as sophisticated as a custom DB) that holds the information regarding which elements are mandatory for which derived type, and thus automatically generate the schema definitions for the derived types in the repository.


Conclusion

The two alternatives presented here offer very good functionality for mapping and validation in exchange of maintenance burden (also into account the practical necessity to version the schema definitions).

The upcoming part 2 of the article will describe alternatives that offer a different tradeoff. 

Tuesday, September 23, 2014

The importance of Data Glossaries

In the Structured Integration approach:


Data is a foundational element, underlying the other three. 

Services (which in my terminology include APIs) exchange data in the form of JSON or XML messages, whose fields/elements must be defined and documented.

Events, no matter whether we deal with fine-grained Domain Events within an application or with coarse-grained events (such as an SAP IDoc published by a middleware adapter) also contain collections of fields.

In this post I am not going to discuss the governance of complex message or event structures (defined via XSD or JSON Schema), but I rather intend to talk about something more basic: the simple data field (generic string, numeric string, or "true"/"false" boolean string).


Documenting simple fields

Every time a simple data value must be handled, three aspects come into play: encoding, syntax, and semantics.

While the binary representation and syntax aspects are well covered by tools, the same cannot be said about the semantic aspect.  Restricting the discussion to integration contexts using self-describing formats, the meaning of a data field is communicated mainly via one of the following:
  • <xs:annotation> elements in the XSD with embedded <xs:documentation> element.  A XSD annotation is typically added as a child element of a <xs:simpleType> declaration, which defined the syntax of the element or attribute
  • "description" fields in JSON Schema definitions
  • description: of API query parameters, very similar same in most API definition notations (RAML, Swagger, etc.)
These documentation elements allow unstructured content and, they may be omitted, or they may be filled with imprecise and inconsistent descriptions, leading to miscommunications between business stakeholders, analysts, and developers.

Clearly communicating the business meaning of data fields that partake any IT-enabled process is key to bridging one of the many instances of the infamous "gap between Business and IT".

In the area of systems integration, the above is true whether the value is part of a JSON API response, is part of a SOAP Body message, or is part a JMS message payload that is being sent out in an event-driven fashion, and so on.

It would be very beneficial if the aforementioned documentation elements were made to mandatory and would refer to a web resource that unambiguously defines the meaning of the data, acting as Single Source of Truth.

 For example, consider the following JSON Schema excerpt:

"replacement_product_id": {
  "description" : "http://mycompany.com/glossary/dataelements/replacement_product_id"  
  "type": "string",
  "pattern": "^(\d)+$"
} 
  
The resource URL in red in the example could be accessing a Glossary API (see resource path /datalements/{element}) rather than link to a static page.  That would open up several possibilities beyond shared documentation, such as schema validation and schema completion, through tools that leverage this API.

The following sections go into some more detail about two distinct key features that Glossary Application could expose via its API:  Data Domains and Semantic Data Elements.

Data Domains

When considering a simple data field definition (which is the subject of this blog post), beyond its immediate syntactical attributes (type, length, pattern, and nullability),  a very important aspect is represented by the set of values that the described field is allowed to assume.

This set of allowed values is sometimes constrained though a value enumeration (e.g., for ISO unit of measure codes), but it is often an open ended value set, like for instance set of customer Ids for all the customers of our company.   Regardless of this, it is important to see the concept of this value set in conjunction with the immediate syntactical attributes of the field, a combination that can be called Data Domain

Looking at individual applications, it is obvious that we end up with many application specific data domains, which can be called Technical Data Domains as they link to technical metadata of specific applications (taking their specific configuration/customizing into account).

Example:
In the world of SAP, this concept appears in the form of SAP Data Dictionary (DDIC) Domains, which may specify a value set either via a code list or by referencing a "check table" (i.e., a SAP table that defines the value set as the set of its primary key values).     SAP data dictionary domain MTART, for instance, defines the set of possible material/product types, which are defined in table T134 (check table for the domain).

Still, this concept can be made more general by abstracting away from any specific application and modeling in each enterprise a set of Business Data Domains, which are application agnostic, where in other words all the relevant values are defined purely in business terms.  For example:

Business Domain     Business Value                        Technical (SAP) Domain     Technical (SAP) Value
product_type       FINISHED_PRODUCT  ==>  MTART               FERT

A Business Data Domain can be part of a Common (a.k.a. Canonical) Data Model, however defining application agnostic data values across the board is a massive undertaking.


Semantic Data Elements

Data domains (as defined above) are not specific enough to convey the semantics of a field.  

A Unit of Measure (UoM) field, for example, can be used in a multitude of business processes, services, APIs, and event messages, but mostly with restrictions of its general semantics.   For example we can have an Order UoM (used in a line of an order) or we can have a Pricing UOM (UoM to which a price is referred).

A Semantic Data Element thus represents a specialization of a Data Domain based on the role that a field based on the element in business processes.
Although the underlying data domain is the same, multiple data elements may be be subject to different business rules.   For example, one company may decide that since it sells only finished products, only "discrete" units of measure like Pieces and Boxes may apply to order_unit fields, and UoM values like Kilograms or Liters do not apply.

Any business-relevant simple value used in BPM process instances, API documents, service operations messages, or event documents, should be associated with a Semantic Data Element that defines its business meaning and associated business rules.

If the Data Elements are strictly based on Business Data Domains (application agnostic), then we have true application independence for these elements, as also the applicable values  are defined by business analysts independently from applications.   This is a characteristic of a "pure" Canonical Data Model.

However, in almost all cases, complete application independence is not realistic given the massive necessary effort and the agility required by integration initiatives.


Glossary Applications

One could define such an application as one that allows to precisely define the key terms that should be shared between business and IT, at different levels of granularity.

As such, the scope of a glossary can range from broad business process areas to fine-grained data element definitions.   At a basic level, even a standard CMS like MS SharePoint can serve the purpose, although such a solution is normally not flexible enough without substantial customization.

On the other hand, commercial business glossary tools are normally sold as part of more extensive data governance and MDM suites, which are acquired to support a wide range of IT initiatives.
A list of criteria for the selection of such tools if given here.

For what concerns using such a tool (also) to support data modeling for integration, the most important requirements should be:
  • Support for hierarchical definitions (e.g., functional areas, data domains, data elements)
  • Support of technical metadata and business rules
  • Functionality exposed via a REST API (for consumption by tools)
  • Collaborative features (to allow multiple stakeholders to define glossary items jointly)
  • Metadata export
  • Ability to run in the Cloud

 

Conclusion

The roll-out of a collaborative Data Glossary application with functionality exposed via an API and integrated with schema authoring tools can be a significant step to align business people, IT analysts, and developers, even if just considered in the context of design-time governance processes.

More opportunities lie ahead if we consider in addition the potential for run-time governance (as implemented at Policy Enforcement Points such as API gateways).  Selective domain value translation and selective data validation against business rules are just two examples.   However, these will be the subject of a future post.




Friday, September 19, 2014

Systems Integration, today

The publication of the seminal book about application integration,  Enterprise Integration Patterns (by Hohpe and Woolf) dates back to 2014, exactly a decade ago.  That book (commonly called the "EIP book") systematically laid out and categorized a sizable body of knowledge that Enterprise Application Integration practitioners had accrued and applied informally over several years, mainly using Message Oriented Middleware (MOM) tools.  It has also influenced the design of recent commercial ESB products.

Even though in the early 2000's, businesses have been communicating for 20 years already using EDI,  the Systems Integration space looked vastly simpler back then.   The categorization most commonly applied in the field was:
  • A2A (Application to Application) a.k.a. EAI
  • B2B (Business to Business)
  • B2C (Business to Consumer)
First of all there were no Cloud based applications: each corporate network was an "island" (containing many application systems) that communicated with other "corporate islands" via EDI or B2B exchanges, and with consumers via good old Web 1.0 websites.    EAI occurred within each corporate island (although across multiple geographical sites).

Secondly, since the capabilities of Web 1.0 sites were so limited from the SI point of view (page scripts mainly doing basic data validation and "cosmetic stuff", with no capability to make server calls), the B2C domain was in effect separate from the other two and mainly the concern of a different type of professional, the portal or website designer.

Integration today

Fast forward ten years: we have had the evolution of Web Service standards (WS-*) together with the wide acceptance of the SOA paradigm, then the emergence of the REST paradigm (contextually with the explosion of integrated mobile applications), the rise of Clouds and SaaS,  of Public APIs, and finally the wonderful (or scary, depending of your viewpoint) new world of the Internet of Things (IoT) and Machine to Machine (M2M) interaction.   In each area, many new patterns have been proposed and documented, building on preexisting knowledge and experience.  On top of this, the world of Business Process Management (BPM) has come out of its old silos represented by old-style workflow management systems, and has claimed its "orchestrating" role in the space of systems integration.
Today's reality is that organizations need more and more IT entities to inter-operate across different protocols, technology stacks, and trust domains in order to support innovative processes. 

The complexity and heterogeneity of the landscape that confronts a System Integrator today and the explosion of the number of integration endpoints is unprecedented, while the more traditional challenges related to bridging data syntactical and semantic differences have by no means gone away.

This calls for structured approaches and governance to underlie any SI initiative,  bringing together the multiple inter-related facets of integration, but necessarily in an agile and pragmatic way.

Design patterns remain the basic tools through which architects and designers document and share their ideas, but it is crucial to understand which of the many patterns to apply when, in which combination, and at which level, given the requirements of the process we need to enable through IT. 

Without structure and governance, the big promise of new technologies could result into a much bigger chaos than the one represented by the "spaghetti" integration architecture that EAI was born to address.