Tuesday, September 23, 2014

The importance of Data Glossaries

In the Structured Integration approach:


Data is a foundational element, underlying the other three. 

Services (which in my terminology include APIs) exchange data in the form of JSON or XML messages, whose fields/elements must be defined and documented.

Events, no matter whether we deal with fine-grained Domain Events within an application or with coarse-grained events (such as an SAP IDoc published by a middleware adapter) also contain collections of fields.

In this post I am not going to discuss the governance of complex message or event structures (defined via XSD or JSON Schema), but I rather intend to talk about something more basic: the simple data field (generic string, numeric string, or "true"/"false" boolean string).


Documenting simple fields

Every time a simple data value must be handled, three aspects come into play: encoding, syntax, and semantics.

While the binary representation and syntax aspects are well covered by tools, the same cannot be said about the semantic aspect.  Restricting the discussion to integration contexts using self-describing formats, the meaning of a data field is communicated mainly via one of the following:
  • <xs:annotation> elements in the XSD with embedded <xs:documentation> element.  A XSD annotation is typically added as a child element of a <xs:simpleType> declaration, which defined the syntax of the element or attribute
  • "description" fields in JSON Schema definitions
  • description: of API query parameters, very similar same in most API definition notations (RAML, Swagger, etc.)
These documentation elements allow unstructured content and, they may be omitted, or they may be filled with imprecise and inconsistent descriptions, leading to miscommunications between business stakeholders, analysts, and developers.

Clearly communicating the business meaning of data fields that partake any IT-enabled process is key to bridging one of the many instances of the infamous "gap between Business and IT".

In the area of systems integration, the above is true whether the value is part of a JSON API response, is part of a SOAP Body message, or is part a JMS message payload that is being sent out in an event-driven fashion, and so on.

It would be very beneficial if the aforementioned documentation elements were made to mandatory and would refer to a web resource that unambiguously defines the meaning of the data, acting as Single Source of Truth.

 For example, consider the following JSON Schema excerpt:

"replacement_product_id": {
  "description" : "http://mycompany.com/glossary/dataelements/replacement_product_id"  
  "type": "string",
  "pattern": "^(\d)+$"
} 
  
The resource URL in red in the example could be accessing a Glossary API (see resource path /datalements/{element}) rather than link to a static page.  That would open up several possibilities beyond shared documentation, such as schema validation and schema completion, through tools that leverage this API.

The following sections go into some more detail about two distinct key features that Glossary Application could expose via its API:  Data Domains and Semantic Data Elements.

Data Domains

When considering a simple data field definition (which is the subject of this blog post), beyond its immediate syntactical attributes (type, length, pattern, and nullability),  a very important aspect is represented by the set of values that the described field is allowed to assume.

This set of allowed values is sometimes constrained though a value enumeration (e.g., for ISO unit of measure codes), but it is often an open ended value set, like for instance set of customer Ids for all the customers of our company.   Regardless of this, it is important to see the concept of this value set in conjunction with the immediate syntactical attributes of the field, a combination that can be called Data Domain

Looking at individual applications, it is obvious that we end up with many application specific data domains, which can be called Technical Data Domains as they link to technical metadata of specific applications (taking their specific configuration/customizing into account).

Example:
In the world of SAP, this concept appears in the form of SAP Data Dictionary (DDIC) Domains, which may specify a value set either via a code list or by referencing a "check table" (i.e., a SAP table that defines the value set as the set of its primary key values).     SAP data dictionary domain MTART, for instance, defines the set of possible material/product types, which are defined in table T134 (check table for the domain).

Still, this concept can be made more general by abstracting away from any specific application and modeling in each enterprise a set of Business Data Domains, which are application agnostic, where in other words all the relevant values are defined purely in business terms.  For example:

Business Domain     Business Value                        Technical (SAP) Domain     Technical (SAP) Value
product_type       FINISHED_PRODUCT  ==>  MTART               FERT

A Business Data Domain can be part of a Common (a.k.a. Canonical) Data Model, however defining application agnostic data values across the board is a massive undertaking.


Semantic Data Elements

Data domains (as defined above) are not specific enough to convey the semantics of a field.  

A Unit of Measure (UoM) field, for example, can be used in a multitude of business processes, services, APIs, and event messages, but mostly with restrictions of its general semantics.   For example we can have an Order UoM (used in a line of an order) or we can have a Pricing UOM (UoM to which a price is referred).

A Semantic Data Element thus represents a specialization of a Data Domain based on the role that a field based on the element in business processes.
Although the underlying data domain is the same, multiple data elements may be be subject to different business rules.   For example, one company may decide that since it sells only finished products, only "discrete" units of measure like Pieces and Boxes may apply to order_unit fields, and UoM values like Kilograms or Liters do not apply.

Any business-relevant simple value used in BPM process instances, API documents, service operations messages, or event documents, should be associated with a Semantic Data Element that defines its business meaning and associated business rules.

If the Data Elements are strictly based on Business Data Domains (application agnostic), then we have true application independence for these elements, as also the applicable values  are defined by business analysts independently from applications.   This is a characteristic of a "pure" Canonical Data Model.

However, in almost all cases, complete application independence is not realistic given the massive necessary effort and the agility required by integration initiatives.


Glossary Applications

One could define such an application as one that allows to precisely define the key terms that should be shared between business and IT, at different levels of granularity.

As such, the scope of a glossary can range from broad business process areas to fine-grained data element definitions.   At a basic level, even a standard CMS like MS SharePoint can serve the purpose, although such a solution is normally not flexible enough without substantial customization.

On the other hand, commercial business glossary tools are normally sold as part of more extensive data governance and MDM suites, which are acquired to support a wide range of IT initiatives.
A list of criteria for the selection of such tools if given here.

For what concerns using such a tool (also) to support data modeling for integration, the most important requirements should be:
  • Support for hierarchical definitions (e.g., functional areas, data domains, data elements)
  • Support of technical metadata and business rules
  • Functionality exposed via a REST API (for consumption by tools)
  • Collaborative features (to allow multiple stakeholders to define glossary items jointly)
  • Metadata export
  • Ability to run in the Cloud

 

Conclusion

The roll-out of a collaborative Data Glossary application with functionality exposed via an API and integrated with schema authoring tools can be a significant step to align business people, IT analysts, and developers, even if just considered in the context of design-time governance processes.

More opportunities lie ahead if we consider in addition the potential for run-time governance (as implemented at Policy Enforcement Points such as API gateways).  Selective domain value translation and selective data validation against business rules are just two examples.   However, these will be the subject of a future post.




Friday, September 19, 2014

Systems Integration, today

The publication of the seminal book about application integration,  Enterprise Integration Patterns (by Hohpe and Woolf) dates back to 2014, exactly a decade ago.  That book (commonly called the "EIP book") systematically laid out and categorized a sizable body of knowledge that Enterprise Application Integration practitioners had accrued and applied informally over several years, mainly using Message Oriented Middleware (MOM) tools.  It has also influenced the design of recent commercial ESB products.

Even though in the early 2000's, businesses have been communicating for 20 years already using EDI,  the Systems Integration space looked vastly simpler back then.   The categorization most commonly applied in the field was:
  • A2A (Application to Application) a.k.a. EAI
  • B2B (Business to Business)
  • B2C (Business to Consumer)
First of all there were no Cloud based applications: each corporate network was an "island" (containing many application systems) that communicated with other "corporate islands" via EDI or B2B exchanges, and with consumers via good old Web 1.0 websites.    EAI occurred within each corporate island (although across multiple geographical sites).

Secondly, since the capabilities of Web 1.0 sites were so limited from the SI point of view (page scripts mainly doing basic data validation and "cosmetic stuff", with no capability to make server calls), the B2C domain was in effect separate from the other two and mainly the concern of a different type of professional, the portal or website designer.

Integration today

Fast forward ten years: we have had the evolution of Web Service standards (WS-*) together with the wide acceptance of the SOA paradigm, then the emergence of the REST paradigm (contextually with the explosion of integrated mobile applications), the rise of Clouds and SaaS,  of Public APIs, and finally the wonderful (or scary, depending of your viewpoint) new world of the Internet of Things (IoT) and Machine to Machine (M2M) interaction.   In each area, many new patterns have been proposed and documented, building on preexisting knowledge and experience.  On top of this, the world of Business Process Management (BPM) has come out of its old silos represented by old-style workflow management systems, and has claimed its "orchestrating" role in the space of systems integration.
Today's reality is that organizations need more and more IT entities to inter-operate across different protocols, technology stacks, and trust domains in order to support innovative processes. 

The complexity and heterogeneity of the landscape that confronts a System Integrator today and the explosion of the number of integration endpoints is unprecedented, while the more traditional challenges related to bridging data syntactical and semantic differences have by no means gone away.

This calls for structured approaches and governance to underlie any SI initiative,  bringing together the multiple inter-related facets of integration, but necessarily in an agile and pragmatic way.

Design patterns remain the basic tools through which architects and designers document and share their ideas, but it is crucial to understand which of the many patterns to apply when, in which combination, and at which level, given the requirements of the process we need to enable through IT. 

Without structure and governance, the big promise of new technologies could result into a much bigger chaos than the one represented by the "spaghetti" integration architecture that EAI was born to address.