Tuesday, September 15, 2015

Structured Mapping Specifications

Data mapping and transformation constitute a key part of the vast majority of integration solutions.
Notable exceptions are the cases where the same data model is shared across the whole integration, from service consumers to backend systems (a case not very frequent in practice), and the case of pure file transfers.

In virtually all other cases there is always the need to map source messages into target messages.

A high-level view of mapping

Below are a few definitions and assumptions to define the context of this article.

A Message is a tree-structured data container modeled after the Message Enterprise Integration Pattern.
All practical message formats (XML, JSON, flat file, EDI file) can be parsed into a tree-structure representation.

A Simple Message Mapping
  •  transforms exactly one source message into exactly one target message (1-1 mapping)
  •  can achieve both structural and value transformations (value transformations can include any function including value computation and value cross-referencing, for example)
  • only uses information that is present in the source message or in the integration environment (e.g., timestamp generation, UUID generation, use of environment properties, etc.)
A Complex Message Mapping can be of these different kinds:
  • 1-N mapping: multiple target messages are produced from the same source message. This case is trivial as it can be simply expressed as the "parallel" combination of multiple simple (1-1) message mappings.
  • N-1 mapping: in this case multiple source messages are transformed into a single target message.  The source messages can result from a message join or correlation (typically in a BPM solution), but can also include secondary source messages produced as output by data enrichment functions that are invoked based on a "main" source message as input.  Without analyzing this case in detail (this may be the subject of a subsequent post) one can be intuitively grasp that such a N-1 mapping can be implemented as a "network" of 1-1  mappings (some of while providing the inputs for one or more Data Enrichers).  Once the source messages are defined, a mapping rule can be expressed by qualifying each source field with the source message it belongs to.
  • N-M mappings:  these are the most general as they produce M target messages from N source messages, can be defined as a set of N-1 mappings. 
The rest of the article discusses ways of specifying  simple message mapping only, leaving the topic of complex message mapping specification to a future post.


Mapping specification issues

Traditionally, mapping specifications are being delivered to integration developers by analysts in the form of mapping spreadsheets (the so-called Mapping Sheets), and integration developers need to figure out how to best implement the specification using a combination of tools, based on the integration platform they are working on (tool specific visual mapping configuration, Java code using mapping utility classes, XSLT sheets, etc.)

Almost always, these mapping specifications are ambiguous to a certain degree. Thus, they require multiple subsequent clarifications with the analyst and/or they are misinterpreted by the developer (so it turns out during integration testing that the implemented mapping was not what the analyst intended and needs to be reworked).

This article suggest some ways to improve on this situation.

Let's look first at the way mapping sheets are generally structured, and some reason why they are ambiguous (I do not have the ambition to be exhaustive here, as people can be very creative when it comes to being unclear with their specs).
What we generally find are Excel sheets structured in three sections, from left to right:
  • Source: specifies the source field for the mapping  (generally via some XPath-like notation)
  • Transformation: specifies the transformation logic (if any, beyond simple value copy) necessary to obtain the value for the target field from the value of the corresponding source field
  • Target:  specifies the target field for the mapping  (gain, mostly via some XPath-like notation)
This structure is more than adequate for simple cases, where the source and target messages are simple records (without repeating sub-records) and each target field is obtained either from a single source field or is hardcoded (hardcoding is usually expressed by specifying the fixed value in the Transformation section and leaving the Source section empty).

However, as we need to handle more complicated message structures and more sophisticated mappings, some limitations become apparent.

Source-oriented mapping specs are not convenient

Many analysts build their mapping sheets by listing all source message fields (in the "topological" order in which they appear in the source message) and then add the corresponding target message fields with transformation information on the right.    This is one of the worst possible things to do.
The purpose of data mapping is to generate target messages, and the developer will have to decompose the implementation of any nontrivial mapping based on the structure of the target message.
Structuring the mapping sheet according to the source message makes thing difficult for the developer, who will likely find the information needed for mapping a target structure scattered across the mapping sheet.   When the same source field, for example a currency code, is mapped to different target fields in completely different places within the target message structure, grouping these mapping lines by source field is really confusing and counterproductive.

Multi-input mappings not expressed cleanly

Whenever a target field needs to be mapped from more than one source field (even if the mapping logic is something as simple as a string concatenation or a sum), then expressing the mapping as Source field - Transformation - Target field is inadequate.   People usually end up specifying one of the mapping inputs as the source field and the others in the Transformation rule (in an unstructured way).  Therefore, it is not immediately clear to the developer which is the complete set of inputs that go into the mapping.

Ambiguity when mapping repetitive structures

Only the simplest mappings do not involve repeating data structures.  Most business documents contain lists of items and quite often there are nested lists (for example, an invoice can contain multiple invoice items, and each invoice item normally contains a list of item-level references and a list of item-level discounts and/or surcharges).
This is probably the area where mapping specifications are most ambiguous, as it is often not clear from mapping sheets what drives a repetitive mapping, and consequently over what the mapping code needs to loop in order to implement it.
In most cases, the repetitive mapping must iterate on a list in the source message (i.e., we have to map one target structure instance for each instance of a source structure), but that is often not well expressed in the mapping sheet, which may not contain clearly marked sections for repetitive mappings.  In some cases, XPath-like expressions used in the sheets do not denote which structure is a sequence (e.g.  /SalesOrder/Items/ProductCode instead of /SalesOrder/Items[]/ProductCode, with the developer being supposed to know that Items is a sequence).

Ambiguity when using filtering conditions

Filtering conditions in mapping specifications come essentially in two forms:
  1. Conditional mapping condition:  a precondition for mapping to be executed at all  (if it evaluates to false, then the specified target remains unmapped).  For example, within an invoice mapping, the mapping of a "related invoice reference" is only performed if the billing document  is not a normal invoice but rather a credit or debit note that reference an earlier invoice.
  2. Sequence filtering/selection conditions:  within the mapping of repeating structure (i.e., a list of strings, records, or documents), we map only from the instances in the list that satisfy a certain condition.  For example, when mapping Ship To party information from an EDIFACT document, we normally have to map from the particular instance of the NAD (Name And Address) segment whose qualifier value equals "DP" (Delivery Party).  The filtering condition can result into zero, one or more target structure instances being mapped. 
In many mapping sheets, these two types of conditions are not clearly distinguished.



Principles for better mapping specifications

This section gives some advice for improvement, so that the developer has an easier time implementing the mapping as specified.

Target orientation

Since the goal is to produce target documents at runtime, it is logical to start a mapping sheet by listing, on its left side, the structures and fields within the target message that need to be mapped.

Do not list all the existing target message fields including those which do not need being mapped to: this will lead to an unreasonably long list of fields which is unwieldy and distracting (especially when dealing with huge standards-based message schemas).  Keep the mapping sheet as concise as possible.  Using mapping sheets for the additional purpose of documenting data structures is not good practice:  data structures should be documented separately, ideally in a Data Glossary.

Since the target message will in most cases contain sub-structures, group the target fields into sections within the mapping sheet (one section per substructure), in the same order in which each field and substructure appears within the target schema definition. This in in preparation of step below (identification of repetitive structures).
It is good practice to leave one blank row between adjacent sections in the mapping sheet for better clarity.

Top-down mapping decomposition

Top-down decomposition is a fundamental principle in all engineering activities.  As applied in this context, the mapping of a complex target message needs to be decomposed into sub-mappings, namely the sub-mappings for the main sub-structures within the target message.
For example, the mapping of an EDI message (definitely a complex structure) must be decomposed into the mappings of its segments and sub-segments (only those in mapping scope, of course).

This decomposition, driven by the structure of the target message, should be explicit already in the mapping sheet.  Even if the developer will not strictly adhere to it in his/her implementation (typically for performance optimization reasons), it will provide a clear, structured view of the mapping.

Clear identification of multi-input mappings

Conceptually, one can take the "functional" view that a target data structure is a function of the source message, and more specifically of a defined set of source fields.
The simplest and most common case is the 1-1 assignment, where the value of a target field is a function of just one source field, and the mapping function (transformation) is a simple copy.

In general, however, a target field value can be a function of more that one source field value (T=f(S1, S2, ..., Sn)).  It is very important to group the field paths of the source fields (on the right side of the mapping sheet) under the target field whose mapping they contribute to (that target field path will be on the left side of the mapping sheet).

Each such grouping forms a small mapping subsection in the mapping sheet, which contains the specification of the mapping function or logic to be applied to the mapping inputs (e.g., replacement in string pattern, sum, maximum, etc.) in the Transformation column.
To avoid ambiguity, when the transformation logic operates on multiple source fields, it should refer to these fields as {1}, {2}, ... where the number in braces represents the order of the source field rows (from top to bottom) within the mapping subsection (not the complete mapping sheet).
Conceptually, each single mapping corresponds to a mapping subsection with the paths of all the necessary source fields listed in separate rows on the RHS of the subsection.   Each mapping subsection must be visually marked in some way (from example using cells borders), to clearly show which set of  source fields participates in the mapping. 

In some cases one of mapping inputs could be a target field already mapped, rather that a source field.  This is mostly done when the mapping logic to obtain a value to be used multiple times is complex, and should not be repeated. A mapping sheet can cope with this practice, by specifying as source field something like target:/<path to target field already mapped>.  It is better practice however to factor out complex mapping logic into "complex mapping functions" (see subsection below), and just reference these functions multiple times in the mapping sheet, each time clearly specifying the inputs to be passed, the latter being strictly source field paths.

If the same source field concurs to the mapping of multiple target fields, then it will have to appear multiple times in the mapping sheet (on the RHS), while a target field should normally appear only once on the LHS of the sheet.

Identification of repetitive structures, repetition triggers, and scopes

Having grouped (as mentioned above) the target fields by the structures they belong to within the target message schema, it is necessary to identify which structures are repeating within the target schema (in general, their multiplicity), and then what drives the repetitions.

Sometimes there is a fixed number of repetitions that is known in advance.  Keeping to the EDIFACT invoice example, we may have to map within the EDI message five instances of the NAD (Name And Address) segment: for the Seller party, the Sold-To party,  the Ship-To party,  the Invoice Receipient party, and the Payer party.
Here there would be five separate mapping sections for the /INVOIC/NAD[] repeating structure, one mapping exactly one instance of the NAD segment from different source data.
Each section should be preceded by the identification of the instance that is being mapped, via an index.  Example:
/INVOIC/NAD[0]   ---> mapping of NAD instance from Seller party
/INVOIC/NAD[1]   ---> mapping of NAD instance from Sold-To party
/INVOIC/NAD[2]   ---> mapping of NAD instance from Ship-To party
etc.
More frequently the mapping of a repeating structure in the target message is driven by a corresponding repeating structure in the source message.  For example, for each invoice line item in the source message we need to instantiate and map a corresponding line item in the target message (in the case of an EDIFACT message, a LIN segment).  The instances of the source repeating structure constitute the repetition triggers for the mapping of the corresponding instances of the target  structure. 
If the source sequence is empty, there is no trigger, so no mapping occurs.

This type of mapping logic is sometimes called "push mapping", the occurrence of structure instances within the source message "pushes" the generation of corresponding structure instances in the target message.  Conversely, "pull mapping" is the way of working discussed above that expresses a given target data item as a function of one or more source data items.
Push mapping is implemented either by looping on the of the repeating source structure (via an explicitly programmed loop or <xsl:for-each> in XSLT),  or by some type of pattern matching rule (e.g., <xsl:template match ="..."> in XSLT).
It is key that the sections for the repeating structures are clearly identified within the mapping sheet, with the target sequence on the leftmost column (along with all target fields) and the corresponding source sequence (repetition trigger) in one of the middle columns (to the left of the source fields).
The path to the repeating structures must be in the first row of its section, ending with index-less brackets ([]) to explicitly denote that we are dealing with a sequence that is to be iterated on in a repetitive mapping.   In addition, the rows for repeating structures should be marked visually via a different background color.

Within a repetitive mapping,  we normally map many fields and sub-structures, so we must identify the source and target mapping scopes.   The source scope is the particular instance of the source repetitive structure that is being mapped from in the given iteration.  The target scope is the particular instance of the target repetitive structure that is being mapped to in the given iteration.
In many cases, values are being mapped from the source scope to the target scope (e.g., from canonical document line items to application specific document line items).
In addition, some repetitive mappings can be be from outside the source scope, because they are hardcoded or are from fields that are not part of the source scope (typical case is the mapping of a currency code to all lines of a target message from the same fields in the header section of the source message).
Finally, repetitive mapping can be nested (e.g. discount and surcharge sequence within an invoice item sequence), thus leading to nested source and target mapping scopes.
Examples:
  • /INVOIC/LIN[] denotes the repetitive LIN structure within the INVOIC message  
  • /INVOIC/LIN[]/QTY[]  denotes the repetitive QTY structure within the repetitive LIN structure within the invoice message (nested scope). 
  • /INVOIC/LIN[0]/QTY[]  denotes the repetitive QTY structure within first instance of the LIN structure within the invoice message  
Note: one could correctly argue that the message root (/INVOIC) is the outermost scope, so every scope within it is a nested scope.
Normally, within each scope, we have to map multiple fields, so we need to specify source and target field paths for these mappings.  Writing out the complete, absolute field paths that start from the message root (e.g., /INVOIC/LIN[]/QTY[]/C186/6060) provides better clarity although it carries come redundancy as the path for the containing repeating structure (/INVOIC/LIN[]/QTY[]).
Note that the braces in a field path such as /INVOIC/LIN[]/QTY[]/C186/6060 do not denote a whole sequence, but just the current instance of the sequence within the mapping scope.
To completely avoid ambiguity in this respect, one can use a dot notation to indicate "current instance" as is /INVOIC/LIN[.]/QTY[.]/C186/6060.   With this notation, /INVOIC/LIN[.]/QTY[] would represent the complete QTY sequence within the current line item (LIN) of the EDIFACT invoice.
Providing field paths relative to the current scope (e.g. just C186/6060 in this example) is more concise but forces the developer to "reconstruct" that complete field path. 
Except in cases where names are especially long and structures very deeply, using the complete path for each field is probably the best option.

Clear distinction between conditional mapping and sequence filtering

In the mapping sheet, these should be separate columns for Mapping Condition and Filtering Condition.
  • Mapping Condition: condition that must be satisfied  in order for the mapping to be executed at all.  This can apply to both simple field mappings (e.g., map only if the source value is not 0) and for repetitive mappings (e.g., map the target sequence only if another source field, not part of the source sequence, meets a certain condition)
  • Filtering/Selection Condition: condition that must be satisfied in the current source scope of a repetitive mapping in order for the current source scope (in other words, the current instance of the source sequence) to be used in the mapping.  For example when mapping the Sold-To customer from an EDIFACT inbound sales order (ORDERS) message, we need to select the NAD segment to map from based its qualifier value being equal to " BY" (= Buyer = Sold-To party).  The column for such filtering/selection conditions can be populated for any field mapping, not just in "sequence" rows.
Quite intuitively, empty mapping or filtering/selection conditions indicate that the mapping is executed unconditionally.
However, to reduce overhead in mapping sheets, the mapping condition "map only if source value is populated" is normally implicit,  If a null source value is a possible situation, distinct from the "source value not populated" situation, then such a mapping condition may have to be explicitly spelled out.

The source fields tested by a condition must be listed together with the source fields used in the actual mapping on the RHS of the sheet, in the mapping subsection of the target field.  As in the Transformation logic, also when writing condition one must uniquely identify the input value via the {n} notation  (denoting the nth source field within the mapping subsection for the target field).
Note that if the condition is very complex, it may be expressed via a  Complex mapping Function (see below).

Isolation of Complex Mapping Functions

When the mapping logic is complex and/or is needs to be reused in multiple places within a mapping sheet, it is good practice to factor out this logic by defining a named Complex Mapping Function with zero or more inputs and one or more output (usually one).
This keeps the clutter in the Transformation column of the sheet to a minimum and avoids redundancy when the logic is applied more than once.
Having specified the function, the analyst just has to reference its name (and the name of the output, it the function produces more than one) in the relevant cells in the Transformation column. If the function has more that one output, a notation such as <output name>=<function name>(...) should be used to indicate which output is used for the mapping.
Defining multi-output functions is useful to reduce redundancy where the logic would be largely shared across two or more functions (e.g.,  there is a complex procedure to determine a first target value, then a second target value is obtained by some computation from the first)

In case of multiple inputs, mapping function should be defined with multiple named input parameters that must be associated with  corresponding source field paths (the "actual" parameters) in the RHS of the mapping sheet.  This is a case of multi-input mapping as described above, so the transformation would be expressed as follows:
[<output name>=]<function name>(<parameter name 1>={n}, <parameter name 2>={m}, ...)
where again {n} and {m} represent the source fields in the n-th and m-th position within the mapping subsection.  The part [<output name>=] is only used in  case the function produces multiple outputs.
When the mapping function has a single input parameter the notation above is unnecessary and one can just write <function name>(.) in the Transformation cell (or <output name>=<function name>(.) if the function has a single input but more than one output).

Complex mapping functions are normally defined at the bottom of the mapping sheet, but if there are several of them and they are shared across multiple mapping sheets of a mapping workbook, then they may warrant a dedicated sheet or tab in the workbook.

It goes without saying that the definition of such function should be a specification for the developer, who can technically implement it in different ways as long as the specification is met.

Resulting mapping sheet structure

Hierarchical organization at row level

The different grouping of mapping sheet rows as described above lead to the following hierarchy at row level:
  1. Row sections for Repeating structures (with possible nesting of sections for sub-structures)
  2. Row sections for single target field mappings (with multiple mapping source fields)
  3. Rows for individual source fields
The mapping sheet section with the Complex Mapping Functions definitions is not part of this hierarchy and is below the mapping rows or in a separate tab.

Suggested columns in a mapping sheet

Based on the practices suggested above in the article, I suggest the following columns (from left to right) for a mapping sheet:
  1. Short Description or Source Data Glossary Link
  2. Target Field/Structure Path
  3. Mapping Condition
  4. Transformation  
  5. {n}  (source input number)   
  6. Source Field/Structure Path
  7. Filtering/Selection Condition 
For column #1, it is of course preferable to use links to a central glossary to document all or most data fields and structures used in integration work, as explained in previous blog article The importance of Data Glossaries.  Only one Description field is provided to supply general functional information, as it is not the purpose of the mapping sheet to define source or target data formats.

Beyond Mapping Sheets

While the vast majority of mapping specifications consist of Excel sheets, a case can be made for expressing these specifications in a machine-readable Domain Specific Language (DSL) that is still readable by humans.   A YAML-based DSL could be defined, for example.
Although Excel sheets can be exported as CSV for easy machine readability, the hierarchical structure of a mapping specification is not simple to express cleanly without using visual characteristics (borders for scopes, cell coloring, etc.)  which would be lost when exporting to CSV.

One use of a mapping DSL could be to automatically generate mapping code that would be completed and refined by the developer.
Automatic generation of good mapping code is a challenging topic not just technically, as it will place clear responsibility for formal definition of the mapping logic on the analyst, making the analyst in effect a meta-programmer.  Optimization of the generated code is also an issue unless the analyst intelligently factored out shared logic.

Another, more realistic application, would be having software tools process such formal mapping specifications for impact analysis in Change Management.   





 

No comments :

Post a Comment