Friday, November 11, 2016

Decoupling APIs from backend systems (part 1)

Once published,  an API should be stable and long-lived, so its clients can depend on it long-term.  A great deal of literature has rightfully pointed out that an API definition constitutes a contract between the API providers and its clients (a.k.a. API consumers). 

In many large enterprises, an API layer is being built on top of the existing application landscape via an ESB and an API gateway.  The implementation of the APIs relies on the existing backend systems (ERP systems, cloud-based solutions, custom databases and other legacy systems) integrated via the ESB.

This contrasts with the green-field architectures that can be adopted by a start-up, in which there are no legacy systems or heavyweight ERPs to deal with. As a result, these architectures can be built from the ground up using microservices and APIs.

This two-part article discusses two patterns by which we can decouple a REST API from the backend systems that underlie its implementation.
The first pattern is the subject of this blog post (part 1), the second pattern builds on the first and will be described in part 2.

Typical API enablement architecture

 A common realization of an API layer on top of enterprise backend systems is shown in the figure below.

Figure 1: Typical high-level architecture for API enablement
API invocations are synchronous, and they are mediated by the API Gateway and the ESB:
  • The API Gateway deals with API policies (authentication, usage monitoring, etc.)
  • The ESB provides the API implementation by orchestrating interactions with one or more backend, and mapping between backend data formats and the API data format (i.e the resource representations since we are talking about REST API)
 These two architectural components give our API's important degrees of decoupling from the backends, notably in terms of security, data format independence, and protocol independence.

In sophisticated API architectures, the ESB implements up to three tiers or layers of APIs:
  1. "System" APIs, exposing raw backend data resource representations (e.g.,  JSON or XML representations of DB record sets)
  2. "Process" APIs, exposing "business domain" resources using some kind of backend-agnostic, "canonical" representation
  3. "Experience" APIs, exposing customized views of the Process APIs for specific types of clients; these API's are typically just a "projection" of the Process APIs
Such layering is advocated by MuleSoft in this white paper.  A layered API architecture fosters re-use of  lower-layer API's and can help partition the responsibility for bridging the gap between backend data and the resource representations we want to expose to API clients.

It is a fact that the gap between legacy systems and our desired resource representations can be quite wide, especially if we want our API to follow the RESTful HATEOAS principle and expose resource representations following JSON API or HAL, for example. 


But ... we still have coupling!

Despite its many advantages, an architecture for REST APIs as described above can still possess a fairly high degree of coupling.
There are in my view two main sources of residual coupling between API clients and our (theoretically hidden) backends:
  • Time coupling associated with simple synchronous request-response
  • Data coupling in resource URIs and resource representations
This part 1 of the article addresses the first kind of coupling, part 2 will be addressing the second.

Quite clearly,  the presence of a synchronous invocation scope that spans all the way into the backend systems couples API clients with the backends that are part of the API implementation.  
This can be problematic for a number or reasons:
  1. The backend systems needed by the API implementation need to be highly available in order for the API to meet its agreed availability SLA.  
  2. The performance of a synchronous chain of invocations can be compromised by a single "weak link".  In many cases, backends can end up being the weak link.  A slow response from a single backend can cause every API based on it to beach its response-time SLA.
  3. Sometimes synchronous interactions with systems of record are not encouraged, especially for create/update operations.  For example, it is common for SAP teams to limit or forbid usage of synchronous BAPIs to post update transactions, demanding instead the use of IDocs (which are normally processed asynchronously) for this purpose.
  4. Some legacy backends may not even support synchronous request-response interactions (using staging DB tables or even files to exchange data).

 

Asynchronous update APIs

Here I describe an asynchronous approach when using APIs to update backend resources (above all for POST and PUT operations, but possibly also for DELETE operations).
This is more complex than the simple synchronous approach but gains us the precious advantage of decoupling between API clients and backend systems.

For query (GET) operations it is usually sufficient to appropriately implement paging to cope with queries that are long-running queries and may yield a large resultset.   However if the backend endpoint does not support paging or does not support a synchronous interaction, then the asynchronous approach I am explaining below may apply as well.

Behavior of an async update API

The essence of the proposed approach is that the API implementation submits the request message to the backend(s) via asynchronous, one-way protocols and immediately returns to the client an URI representing a "handle" or "promise" that allows to retrieve the API invocation result later (in a subsequent request).

This "promise" pattern loosely takes after the Promise concept in Javascript/Node.js, where a Promise object encapsulates the result of an asynchronous operation that will be completed in the future.

Sticking to the REST update scenario, I can illustrate the proposed mode of operation with the following example.  Against the following request:

     POST /api/billing/v1/vendorinvoices HTTP/1.1
     Host: mycompany.com
     Content-Type: application/json

     {Data}

The response may look like the following (omitting some "customary" response headers such as Date etc.):

     HTTP/1.1 202 Accepted
     Location: https://mycompany.com/api/billing/v1/vendorinvoices/f894eb7a-f2fc-4803-9a7d-644a0261010f
Where f894eb7a-f2fc-4803-9a7d-644a0261010f is a UUID generated by the API infrastructure and stored in a high-speed correlation store.  After some time (possibly suggested to the client in the API response via a Retry-After header), the client should check for the results like this:

     GET /api/billing/v1/vendorinvoices/f894eb7a-f2fc-4803-9a7d-644a0261010f HTTP/1.1
     Host: mycompany.com

On receiving this GET request,  the ESB would look up the correlation store and check whether a response has been posted there in the meantime (after being asynchronously mapped from backend data).  There are three possible outcomes:

1. Result still pending
This should generate the same response as to the original invocation (with HTTP status 202).

2. Success
 Response should be either:
     HTTP/1.1 201 Accepted
     Location: https://mycompany.com/api/billing/v1/vendorinvoices/5105600101
     {Optional Data} 
Or:
     HTTP/1.1 303 See Other
     Location: https://mycompany.com/api/billing/v1/vendorinvoices/5105600101
     {Optional Data} 
The important point here is that the "handle" or "promise" URI gets exchanged with the actual URI of the created resource, which contains the document ID generated by the target backend (i.e. 5105600101)
A subsequent GET request to the resource URI would be not go through the correlation store anymore as standard ESB logic would recognize the URI as an actual resource URI.

3. Error
The correlation store needs to record any error response produced as a result of the execution of the API against a backend system.  This error information would be returned in the response body with an appropriate HTTP error status code in the 4xx or 5xx range.
In case no result (positive or negative) gets recorded against the "handle" within a given timeout period, an error is returned to the API client.



Implementation

The figure below show how this "promise" pattern can be implemented (the "API implementation block" makes use of the ESB and the API Gateway):

Figure 2: Design for asynchronous API (note: dashed arrows express asynchronous flows)

The original client request (1.1) generates a UUID.
The UUID is mapped as part of the asynchronous update message to the backend  (1.2), for correlation purposes.
A correlation entry in created in the correlation store for the "handle" URI, incorporating the UUID (1.3).
The 201 Accepted response is returned to the client with the "handle" URI in the Location header.
When the new resource is created within the backend system, an asynchronous notification message is sent out (2.1) from which the actual resource URI can be derived. KEY POINT: This message MUST incorporate the UUID so it can be correlated with an existing correlation store entry.  Any posting errors resulting from the processing of the request within the backend  must also be notified via the same channel.
The actual resource URI (in case of success) or alternatively the error information is saved in the correlation store (2.2).

When the client makes a GET request with the "handle" URI (3.1), this is used to look up the correlation store (3.2), and the actual URI or the error information is returned to the client (3.3) via the customary Location header.
If the 3.1 request comes after a certain configured timeout after the creation timestamp of the correlation entry and the correlation entry still has a "null" result  (i.e., no actual resource URL and no error info), then an error can be returned in step 3.3.   Retention periods in the correlation store are discussed below.

In case of success, the client now holds the actual resource URI and can then query it (steps 4.1 to 4.4).

"Hybrid" implementation with synchronous backend-facing logic

At first sight, the complexity of this design seems to reside in the use of the correlation store, but this is not really the case: a sophisticated ESB implementation will normally anyway leverage a reliable, high-speed, high-available data store for caching, idempotency, and other purposes that require storing state.

The most critical point is actually getting out a notification from the backend system (point 2.1),   containing the correlation UUID that was injected via 1.2.  There may be technological constraints within the backend that make this difficult.

In cases where the backend system does allow a synchronous request-response interaction, the best option is to take advantage of the best practice that demands that the API implementation be layered into a "system" API layer and a "business" API layer, and avoid propagating the UUID through the backend altogether. 
This is shown in the figure below, where the System ("backend-facing") API interacts with the backend synchronously but communicates with the upper Business API layer asynchronously.
Figure 3: Decoupling within API implementation
The bulk of the API logic (which maps between resource representations and backend formats, among other things) is hosted in the Business API layer.  It is there that the UUID is generated and then sent with the asynchronous request to the System API layer (message A).  After the A message is sent (typically on an JMS queue), the Business API can insert the "handle" URI into the correlation store and then return it to the client. The Business API thus always very responsive to clients.
The System API implementation is triggered by message A and is completely decoupled from the interaction between the API client and the Business API layer.  The backend-specific integration logic in the System API can update the backend synchronously (even if it has to wait relatively long for the backend response), and can automatically retry the invocation until successful in case  the backend is temporarily unavailable.  All this does not affect the API client in any way: the client holds the "handle" URI,  i.e. the "promise" to retrieve the eventual outcome of the operation.
Once the System API has a "final" (positive or negative) response from the backend (C), it forwards this response asynchronously (again, typically, via a JMS queue), including the UUID that was in the System API synchronous context.  This is message D, which triggers the Business API to update the correlation store with the result.


Retention period in the Correlation Store

Under this design,  correlation store entries are not meant to be kept long-term, but should be automatically evicted from the store after a configurable retention period.
The retention period must be guaranteed to be longer than both the following time intervals:
  1. The longest possible interval between the asynchronous backend  request and the asynchronous backend response (i.e., between events 1.2 and 2.1 in Figure 2 above).
  2. The longest possible interval from the original client request to the latest possible request to exchange the "handle" URI with the effective resource URI (i.e., between events 1.1 and 3.1 in Figure 2 above).  This second interval is the most binding one as it normally needs to be longer than the first.
A retention interval in the order of minutes would be sufficient with most backends, but in the case the API incorporates automated retries in case of backend unavailability (which is advisable for business-critical update operations), then the interval could be substantially longer (in the order of hours or even longer).

If the API receives GET request (3.1 in Figure 2) with a "handle" URI that no longer exists in the correlation store, the API client receives an error. In any case, after retrieving the actual resource URI, a clients is supposed to keep hold of it and not query for the "handle" URI anymore.


Conclusion

The biggest disadvantage of such a design (besides of course the added complexity) is that API clients need multiple API calls to get hold of the actual resource. 
If the client is too "eager" to obtain the final result of the operation, it may poll the API too often while the API result is not yet available in the correlation store.

A REST API still relies on the HTTP(S) protocol which unlike a WebSocket protocol does not allow the server to notify the client of an asynchronous event (in this case: the API result is ready).  

Nevertheless, it is so important to decouple API clients from the internal IT landscape of the enterprise that this pattern should be adopted more widely.

Part 2 of the article, with the aim of addressing the issue of data representation dependencies, will build on this pattern showing how what we called the "handle" or "promise" URI may well become the effective, permanent resource URI.

 

 


Wednesday, October 5, 2016

A categorization for error handling and recovery

Proper handling of error situations is an absolute must for any production-grade integration solution.

This post attempts to categorize the significant cases that occur in the domain of non-BPM integration, considering both synchronous integrations (API, request-reply services, etc.) and asynchronous integrations (pub/sub, polling, etc.)

For each category we describe what features the integration solution should possess to achieve resiliency to errors and ease of error recovery (which is necessary for production support). This post does not discuss in any detail how these features can be realized in ESB tools:  these topics will be addressed in future posts.



Categorization criteria

First of all, the categorization criteria must be clearly presented.

Type of integration (synchronous vs. asynchronous)

Synchronous integrations (such as API's) differ from asynchronous integrations in one fundamental aspect: the invoking client cannot wait long for a response.

This means that in the case of temporary unavailability of a target resource we cannot in general wait and retry (after a delay) while the client code hangs waiting for our response.
Waiting would create usability and performance issues on the client side and the SLA of our API would be probably breached anyway.

 A synchronous integration (i.e., one that follows the Request-Response exchange pattern) must in almost any case immediately return an error response to the client, regardless of the type of error. Retrying is normally not an option unless the retries are attempted within a very short (sub-second) time interval, which makes them less useful as the transient issue is likely to last longer.

In case of a transient error (see below) it is important that the error response can clearly indicate that a retry by the client is desired, as the client always bears the responsibility for retrying the invocation of synchronous integrations.
For a REST API, the more appropriate HTTP status code for transient errors is 503 Service Unavailable, ideally accompanied by a Retry-After response header.


Asynchronous integrations have the big advantage of looser coupling but involve complications, one of which being the frequent requirement of guaranteed delivery to the target resources.

Once our integration fetches a message from a source JMS queue and acknowledges it, or sends  HTTP response status 202 Accepted to a REST client (just two examples), then it accepts delivery responsibility.
An async integration should always automatically retry updates in the face of transient errors (see below).

If the error is not transient, the source message cannot be dropped, but must be parked somewhere for later analysis and reprocessing.  The only exception to this rule are non-business-critical messages (advisory, stats, etc.)

While for sync integrations error reprocessing is invariably initiated from the client or source system, for async integrations this is not necessarily true: the integration solution must often provide usable reprocessing capabilities.


Type of update

It is useful to distinguish three cases with respect to the type of target resources:
  1. Single resource
  2. Multiple resources (transactional) 
  3. Multiple resource (transaction non possible)
The first case is clearly the easiest: since there is just one (logical) target resource affected, the update is intrinsically atomic.  At worst the outcome of the update may be in doubt (see below).

If the integration updates more than one target resource, it may be possible to have a transaction span these updates.  This is typically the case when multiple tables must be updated in the same relational DB, or when we need to send to multiple target JMS destinations. 

In some cases, distributed (XA) transactions across heterogeneous resources may be viable (please bear in mind, however, that XA transactions can have adverse scalability and performance impacts).

A key advantage of transactionality is the automated rollback that takes place in the event of an error, ensuring state consistency across the target resources.   The only exception is the unlucky case of an "in doubt" transaction.

Finally, we have the most tricky case in which it is not technically possible to encompass the multi-resource update within a single transaction.  Thus, if it is required to ensure mutual consistency across related resources at all times, the integration must attempt compensating updates to undo the partial update committed before the error occurred.


Type of error 

When discussing error handling, several categorizations are possible, for example:
  • "technical" errors vs. "functional" errors
  • "data-related" functional errors vs. "configuration-related" functional errors
  • "explicitly-handled" errors vs. "unhandled" errors 
  • "transient" errors vs. "non-transient" errors
Not all of these categorizations are equally useful.  In particular the technical vs. functional distinction is often fuzzy.
Some errors are clearly of a functional nature:
  •  Data-related:  errors due to incorrect or incomplete data in request messages;  any reprocessing with the same input is doomed to fail again,  and only a corrected source message can go through successfully;
  •  Configuration-related: errors due to incorrect or incomplete configuration in a target system (e.g., necessary master data missing) or in the ESB (e.g., s missing cross-reference entry); in this case resubmitting the same source message can succeed as long as the relevant configuration has been corrected/completed in the meantime
However there is almost always a "gray area" in addition these cases where it is difficult to classify and error as technical or functional.
To simplify things, many practitioners call "technical errors" only those errors that are due to unavailability or unreachability of a target resource (i.e. the same of a "transient error"), and say that all other errors are "functional".

For the sake of this discussion, I distinguish only three cases:
  1. Transient error:  the target resource is momentarily unavailable or lacks connectivity, but it is expected that this situation will not last a long time; we definitely know that the update we attempted did NOT take place due to the unavailability / connectivity issue;  these errors manifest themselves are transport exceptions at runtime.
  2. Non-transient error:   EVERY error other than the transient ones where we definitely know that the update we attempted did NOT take place (typically because the target resource returned an error response).
  3. In-doubt error situations: the update attempt produces a TIMEOUT response but we are not 100% sure that the update did not go though.  Also "in-doubt" transactions fall into this category.
"Ordinary" integration solutions, which are not using automated diagnoses via a rule base or similar, are only able to automatically recover from transient errors by retrying the update periodically.
Indiscriminate automated retry of failed updates for any kind of error is a recipe for infinite loops whenever errors are systematic (as in the case of a data-related error: repeated submission of the same bad data is pointless).

Even in the case of transient errors the number of times the integration retries is normally not indefinite.  Most of the time a sensible retry limit is configured: beyond this limit the error becomes non-transient.


Idempotency of target resources

Idempotency is the capability of any resource (target resource or integration endpoint) to recognize and reject duplicate requests or messages.

Idempotency is a precious asset for production support: it allows us to retry an update when in doubt without worrying about the risk of a duplicate posting.  If we happen to submit something twice, the resource will reject it, thus making the operation safe.

Technically, a target resource can implement idempotency by permanently storing an unique "update key" along with every committed update.  If a request comes in with an update key value that's already found in the store, then it is ignored as a duplicate (without raising and error).
This seems easy but there's more to it than meets the eye: for the whole thing to work, the "contract" that we expose for our integration source(s) (e.g. API clients or source systems) must include a correct definition for our update key.
In fact, this concept and its correct implementation are so important that I will devote another blog post only to the subject of idempotency (plus the issue of "stale update rejection", i.e. rejecting out-of-date requests).

If a target resource is not idempotent, then it is possible to build an idempotency layer around it (or in front of) in our integration logic (as described later in this post).



Error handling / recovery features in ESB

I illustrate in this section the different features for error handling and recovery that may be built into an integration solution.
This prepares for the next section ("Analysis or relevant cases"), which explains which capabilities are necessary based on the categorization criteria defined above.

It is worthwhile noting that these features need to be applied consistently to many integrations and are therefore to be treated as "cross-cutting concerns" (in AOP parlance).
The integration / ESB platform must include shared components and utilities to support these features consistently across the integration landscape.


Automated update retry

Automated retry logic can attempt an operation (we are talking about updates here, but it is applicable to queries as well) until its succeeds or up to a set maximum number of times.

Update attempts are separated by an configured time interval, and sometimes it is possible to have retry intervals become longer at each unsuccessful retry (exponential backoff, for example retries after 5s, 10s, 20s, 40s etc.)

As stated earlier, automated retries are only normally applied to errors that are surely transient (in practice, only to transport errors).  If the configured maximum number of retries is exhausted, then the error becomes non-transient and must be handled as such.

In some cases it is justified or even required to build logic to identify additional standardized error types for which recovery errors can be automated as well.  This is done by matching error responses against a configured "error rulebase".
This more sophisticated approach also requires that detailed audit logging is kept of all integration actions (original update attempt and all retries).  Otherwise it is going to be difficult for support staff to figure out what happened if something went wrong and the automated logic could not recover.

 

Idempotency at ESB level

As stated earlier, if a target resource is not idempotent, it is possible to build idempotency into the integration. 

For each incoming request message,  a functionally suitable unique "update key" must be extracted from it and inserted into a persistent "duplicate detection" store with fast read access (a conventional RDBMS may be used but a NoSQL solution such as Redis is better). 

If the update fails, the entry is removed from the duplicate detection store,  otherwise (if the update was OK) the entry stays in the store for a "reasonable" period of time (beyond which the chance of duplicates becomes acceptably low).  At each execution of the integration, the duplicate detection store is (efficiently) checked for the existence of the update key.  If the key is found then the update is skipped, otherwise the key is inserted into the store and the update is attempted.

In-doubt updates needs special treatment: it we are not sure whether the update went through or not, then the entry in the duplicate detection store (for the update key) must be marked as "in-doubt".  When a new update request later comes is for this same update key, the target resource must be queried to ascertain whether the posting actually existed in the first place, in order to decide whether to repeat the update or not. 

Duplicate detection stores always have a retention period so old entries are purged. The advantage of a data store like Redis is that there is no need for a periodic purge procedure since entries can be inserted with an expiration period enforced by the DB itself.


Transactionality

If an update must encompass multiple resources and transactional update is possible, then transactions should be used by the ESB to ensure consistency of state.

This guideline always holds when dealing with "homogeneous" resources (e.g.,  tables in the same DB,  JMS destinations in the same JMS provider, etc.).  Instead, if distributed (XA) transactions would be required, then the data consistency advantages that they bring should be carefully weighed against complexity, performance, and scalability considerations.

Lastly, it is important to remark that using transactions does not remove the need for idempotency because:
  • Is is still a good idea to guard against the risk that a client or source system sends a duplicate request by mistake, even if the original request was processed correctly
  • In the case of an "in-doubt" transaction we must be able to reprocess without concerns

 

 Compensation

As mentioned above, compensation logic is required when we need to keep related resources mutually consistent at all times and transactions are not an option.
This is far from an ideal solution because, among other things:
  • compensating updates (or "compensating transactions" as they are often called) add complexity and may in turn fail; if the original update failed for technical reasons then compensation will also most likely fail;
  • many systems of records (such as ERP systems) cannot delete a committed transaction such as financial booking, so the cancellation must create an offsetting posting, that exists purely for technical reasons
Therefore, compensating transactions are often avoided provided that the update that's "missing" due to the failure can be made to succeed within a reasonable time (via error reprocessing).
In other words: if the purpose of our integration is to update resources A and B in a coordinated fashion, and only resource B could not be updated, then our purpose is to update B as soon as we can as long as the temporary inconsistency due to the fact that A is update and B does not have a real business and/or regulatory impact.  We should use compensation only when even a temporary data inconsistency is unacceptable .


Error notification 

Generation, routing, and dispatching of error notifications or alerts are part of basic integration infrastructure, and the way they are implemented must of course fit into the organization's Incident Management process.

Please note that error notification is a functionality that is logically separate from logging (the latter is always necessary for both sync and async integrations).   From a technical standpoint,  error notification is frequently triggered from the logging logic (typically based on the severity of the logging entry), but logically it is distinct due to its "push" nature.

Many implementations are possible among which:
  • direct notification  to a support group (via old fashioned email or more modern channel such as Slack)
  • indirect notification though error logging and delegation to a log analysis / operational intelligence tool like Splunk
  • direct injection into an incident management / ticketing system  
Each of these approaches has pros and cons (which may be discussed in a future article), but for sure we must have the notification functionality covered in some way.

For synchronous integrations, error notification is limited in most cases to returning a useful error response to the client.  Good practices are easily accessible online for this (see for example the error response of the Twilio REST API).  Notifying individual errors to Support is not normally done for synchronous interactions, as it is the client's responsibility to act on every error.
Nevertheless, logging and monitoring are key to immediately identifying technical problems that affect the health of our APIs and their adherence to SLA's.  API gateway products can really assist here.

With asynchronous integrations, every non-transient error normally triggers an error event that must be suitably routed and dispatched.  Many criteria can be used for routing and dispatch of such error events for example:
  • the identity of the integration
  • organizational information in the context of the integration that failed (e.g.,  company code)
  • the priority of the source message (regardless of how its is determined)

Sophisticated error handling logic may be able generate a single notification for a burst of errors of the same type that occur within a given (short) period of time ("error bundling").  This kind of logic is essential if we want to directly integrate error events into out ticketing system, and avoid flooding it with a large number of identical errors as a result of an outage. 

 

DLQ and Error Recovery

Most asynchronous integrations are "fire-and-forget" from the standpoint of the source system: if a message can be successfully sent, the source system assumes that the integration logic will take care of the rest.  No response is expected.

Even when asynchronous ACK messages are interfaced from the ESB back to the source system,  they are mostly used for audit only.

Since in the systems integration domain we are mostly dealing with business critical flows,  every single non-transient error (including transient errors for which retries have been exhausted) must be notified (as explained above) and "parked" for later error analysis and recovery.

All information for error analysis and recovery must be placed in a persistent store that is commonly called a Dead Letter  Queue (DLQ), although its does not really have to technically be a queue.

Some considerations:
  • it must be possible to browse the DLQ by support staff
  • if must be possible to quickly find the DLQ entry from the contents of an error notification (which ideally should contain some kind of URL pointing to the DLQ entry)
  • the DLQ entry should link to all available information to diagnose the error including logging entries created during the execution that led to the error (see for more info my earlier blog post on logging)
  • in order to be able to replay a failed integration  the complete runtime context must be available in the DLQ entry and not just the message payload.  For example, an integration invoked via HTTP normally needs URI, query parameters, request headers etc., just re-injecting the HTTP body will not work in general.  An integration triggered through JMS will need the JMS header etc.
The topic of error recovery as part of Incident Management is not a trivial one and cannot be be fully covered here.  However, a few points can be noted:
  • As already stated, the reprocessing a synchronous integration can only be initiated by the client which originally made the failing invocation.
  • In non-BPM integration solutions, it is practically impossible to resume an integration execution halfway, as there is no way to resume execution from a  reliable persisted state.  This possibility instead exists when using process-like ESB tools such as TIBCO BusinessWorks  (with proper checkpointing implemented).
  • For asynchronous integrations, reprocessing on non-transient errors can in theory occur either from the source system or from the ESB's DLQ.  However, the correction of data-related errors requires submitting a corrected (altered) source message and this should not be the responsibility of the team supporting the ESB (no manual manipulation of production messages should be allowed!)  Therefore, the reprocessing of such errors needs to be initiated in the source system.
  • Idempotency is a necessary pre-requisite for reprocessing when partial updates have occurred that could not be rolled back or compensated for. Example: if an error was thrown on updating target resource B after resource A was successfully updated, then it is not possible to replay the integration execution as a whole unless we can rest assured that no duplicate update of A will be made.
  • The risk of "stale updates" is always present when reprocessing master data synchronization messages.  Example: if a message synchronizing product data for product X fails (for any reason) and is parked in the DLQ, we should not reprocess this message if more up-to-date messages were successfully interfaced for the same product, otherwise we would be updating the target with old data. 

 

Analysis of relevant cases

This section matches the categorization of integrations / resources / errors with the necessary error handling / recovery feature. 

To make things more understandable, the following decision tree can be used:

The coding clearly shows what feature are required in which case.

The rationale can be inferred based on all the discussion presented so far, but is still worth while emphasizing the key points for synchronous and asynchronous integrations.

 

Synchronous integrations

Automated update retry capability on transient errors is normally not applicable (reason: most calling clients cannot afford to hang and wait for seconds for a response without serious repercussions on their side).

No DLQ is necessary as reprocessing must be initiated from the source.
Logging and monitoring are necessary although error notification is normally not done in addition to the required error response to the caller.

Idempotency (whether supported natively by the targets or added in the integration layer) is always necessary when the integration update multiple resources and transactions are not possible.
Even in the other cases, idempotency is useful when "in-doubt" situations arise.

Finally, compensating actions are to applied only when transactions are not possible and at the same time it is mandatory that target resources keep a mutually consistent state at all times.

Asynchronous integrations

As its is apparent from the figure above, asynchronous integrations require as a rule a larger set of features related to error handling and recovery.

The first obvious difference with the synchronous case is the use of automated retry logic to mitigate transient errors.

The second big difference is the necessity of some form of (more or less sophisticated) error notification to alert the appropriate actors (support, key users) that something went wrong.

Thirdly, Dead Letter Queue functionality is often put in place to allow repeating the execution of a failed integration with the original input data ("resubmission" or "reprocessing").  Such functionality can only be forfeited if we decide at design time that all reprocessing will always occur from the source system, but is still advisable to still have the DLQ resubmission option available.

Finally, the considerations about idempotency and compensating transactions essentially stay the same as in the sync case.


Conclusions

Hopefully, this article will help  practitioners rationalize integration design decisions based on the technical and functional context at hand.

However since the integration landscape for medium-sized and large enterprises is virtually guaranteed to be wide and varied, it highly advisable that we endow our integration platform with all key capabilities (automated retry, idempotency, transaction support, error notification, DLQ, etc.) in a generic and reusable fashion.
This way, we will be able to easily build in a given capability into an integration when we need it.

Future posts will address how some of these capabilities can be implemented with ESB tools.