Wednesday, October 5, 2016

A categorization for error handling and recovery

Proper handling of error situations is an absolute must for any production-grade integration solution.

This post attempts to categorize the significant cases that occur in the domain of non-BPM integration, considering both synchronous integrations (API, request-reply services, etc.) and asynchronous integrations (pub/sub, polling, etc.)

For each category we describe what features the integration solution should possess to achieve resiliency to errors and ease of error recovery (which is necessary for production support). This post does not discuss in any detail how these features can be realized in ESB tools:  these topics will be addressed in future posts.



Categorization criteria

First of all, the categorization criteria must be clearly presented.

Type of integration (synchronous vs. asynchronous)

Synchronous integrations (such as API's) differ from asynchronous integrations in one fundamental aspect: the invoking client cannot wait long for a response.

This means that in the case of temporary unavailability of a target resource we cannot in general wait and retry (after a delay) while the client code hangs waiting for our response.
Waiting would create usability and performance issues on the client side and the SLA of our API would be probably breached anyway.

 A synchronous integration (i.e., one that follows the Request-Response exchange pattern) must in almost any case immediately return an error response to the client, regardless of the type of error. Retrying is normally not an option unless the retries are attempted within a very short (sub-second) time interval, which makes them less useful as the transient issue is likely to last longer.

In case of a transient error (see below) it is important that the error response can clearly indicate that a retry by the client is desired, as the client always bears the responsibility for retrying the invocation of synchronous integrations.
For a REST API, the more appropriate HTTP status code for transient errors is 503 Service Unavailable, ideally accompanied by a Retry-After response header.


Asynchronous integrations have the big advantage of looser coupling but involve complications, one of which being the frequent requirement of guaranteed delivery to the target resources.

Once our integration fetches a message from a source JMS queue and acknowledges it, or sends  HTTP response status 202 Accepted to a REST client (just two examples), then it accepts delivery responsibility.
An async integration should always automatically retry updates in the face of transient errors (see below).

If the error is not transient, the source message cannot be dropped, but must be parked somewhere for later analysis and reprocessing.  The only exception to this rule are non-business-critical messages (advisory, stats, etc.)

While for sync integrations error reprocessing is invariably initiated from the client or source system, for async integrations this is not necessarily true: the integration solution must often provide usable reprocessing capabilities.


Type of update

It is useful to distinguish three cases with respect to the type of target resources:
  1. Single resource
  2. Multiple resources (transactional) 
  3. Multiple resource (transaction non possible)
The first case is clearly the easiest: since there is just one (logical) target resource affected, the update is intrinsically atomic.  At worst the outcome of the update may be in doubt (see below).

If the integration updates more than one target resource, it may be possible to have a transaction span these updates.  This is typically the case when multiple tables must be updated in the same relational DB, or when we need to send to multiple target JMS destinations. 

In some cases, distributed (XA) transactions across heterogeneous resources may be viable (please bear in mind, however, that XA transactions can have adverse scalability and performance impacts).

A key advantage of transactionality is the automated rollback that takes place in the event of an error, ensuring state consistency across the target resources.   The only exception is the unlucky case of an "in doubt" transaction.

Finally, we have the most tricky case in which it is not technically possible to encompass the multi-resource update within a single transaction.  Thus, if it is required to ensure mutual consistency across related resources at all times, the integration must attempt compensating updates to undo the partial update committed before the error occurred.


Type of error 

When discussing error handling, several categorizations are possible, for example:
  • "technical" errors vs. "functional" errors
  • "data-related" functional errors vs. "configuration-related" functional errors
  • "explicitly-handled" errors vs. "unhandled" errors 
  • "transient" errors vs. "non-transient" errors
Not all of these categorizations are equally useful.  In particular the technical vs. functional distinction is often fuzzy.
Some errors are clearly of a functional nature:
  •  Data-related:  errors due to incorrect or incomplete data in request messages;  any reprocessing with the same input is doomed to fail again,  and only a corrected source message can go through successfully;
  •  Configuration-related: errors due to incorrect or incomplete configuration in a target system (e.g., necessary master data missing) or in the ESB (e.g., s missing cross-reference entry); in this case resubmitting the same source message can succeed as long as the relevant configuration has been corrected/completed in the meantime
However there is almost always a "gray area" in addition these cases where it is difficult to classify and error as technical or functional.
To simplify things, many practitioners call "technical errors" only those errors that are due to unavailability or unreachability of a target resource (i.e. the same of a "transient error"), and say that all other errors are "functional".

For the sake of this discussion, I distinguish only three cases:
  1. Transient error:  the target resource is momentarily unavailable or lacks connectivity, but it is expected that this situation will not last a long time; we definitely know that the update we attempted did NOT take place due to the unavailability / connectivity issue;  these errors manifest themselves are transport exceptions at runtime.
  2. Non-transient error:   EVERY error other than the transient ones where we definitely know that the update we attempted did NOT take place (typically because the target resource returned an error response).
  3. In-doubt error situations: the update attempt produces a TIMEOUT response but we are not 100% sure that the update did not go though.  Also "in-doubt" transactions fall into this category.
"Ordinary" integration solutions, which are not using automated diagnoses via a rule base or similar, are only able to automatically recover from transient errors by retrying the update periodically.
Indiscriminate automated retry of failed updates for any kind of error is a recipe for infinite loops whenever errors are systematic (as in the case of a data-related error: repeated submission of the same bad data is pointless).

Even in the case of transient errors the number of times the integration retries is normally not indefinite.  Most of the time a sensible retry limit is configured: beyond this limit the error becomes non-transient.


Idempotency of target resources

Idempotency is the capability of any resource (target resource or integration endpoint) to recognize and reject duplicate requests or messages.

Idempotency is a precious asset for production support: it allows us to retry an update when in doubt without worrying about the risk of a duplicate posting.  If we happen to submit something twice, the resource will reject it, thus making the operation safe.

Technically, a target resource can implement idempotency by permanently storing an unique "update key" along with every committed update.  If a request comes in with an update key value that's already found in the store, then it is ignored as a duplicate (without raising and error).
This seems easy but there's more to it than meets the eye: for the whole thing to work, the "contract" that we expose for our integration source(s) (e.g. API clients or source systems) must include a correct definition for our update key.
In fact, this concept and its correct implementation are so important that I will devote another blog post only to the subject of idempotency (plus the issue of "stale update rejection", i.e. rejecting out-of-date requests).

If a target resource is not idempotent, then it is possible to build an idempotency layer around it (or in front of) in our integration logic (as described later in this post).



Error handling / recovery features in ESB

I illustrate in this section the different features for error handling and recovery that may be built into an integration solution.
This prepares for the next section ("Analysis or relevant cases"), which explains which capabilities are necessary based on the categorization criteria defined above.

It is worthwhile noting that these features need to be applied consistently to many integrations and are therefore to be treated as "cross-cutting concerns" (in AOP parlance).
The integration / ESB platform must include shared components and utilities to support these features consistently across the integration landscape.


Automated update retry

Automated retry logic can attempt an operation (we are talking about updates here, but it is applicable to queries as well) until its succeeds or up to a set maximum number of times.

Update attempts are separated by an configured time interval, and sometimes it is possible to have retry intervals become longer at each unsuccessful retry (exponential backoff, for example retries after 5s, 10s, 20s, 40s etc.)

As stated earlier, automated retries are only normally applied to errors that are surely transient (in practice, only to transport errors).  If the configured maximum number of retries is exhausted, then the error becomes non-transient and must be handled as such.

In some cases it is justified or even required to build logic to identify additional standardized error types for which recovery errors can be automated as well.  This is done by matching error responses against a configured "error rulebase".
This more sophisticated approach also requires that detailed audit logging is kept of all integration actions (original update attempt and all retries).  Otherwise it is going to be difficult for support staff to figure out what happened if something went wrong and the automated logic could not recover.

 

Idempotency at ESB level

As stated earlier, if a target resource is not idempotent, it is possible to build idempotency into the integration. 

For each incoming request message,  a functionally suitable unique "update key" must be extracted from it and inserted into a persistent "duplicate detection" store with fast read access (a conventional RDBMS may be used but a NoSQL solution such as Redis is better). 

If the update fails, the entry is removed from the duplicate detection store,  otherwise (if the update was OK) the entry stays in the store for a "reasonable" period of time (beyond which the chance of duplicates becomes acceptably low).  At each execution of the integration, the duplicate detection store is (efficiently) checked for the existence of the update key.  If the key is found then the update is skipped, otherwise the key is inserted into the store and the update is attempted.

In-doubt updates needs special treatment: it we are not sure whether the update went through or not, then the entry in the duplicate detection store (for the update key) must be marked as "in-doubt".  When a new update request later comes is for this same update key, the target resource must be queried to ascertain whether the posting actually existed in the first place, in order to decide whether to repeat the update or not. 

Duplicate detection stores always have a retention period so old entries are purged. The advantage of a data store like Redis is that there is no need for a periodic purge procedure since entries can be inserted with an expiration period enforced by the DB itself.


Transactionality

If an update must encompass multiple resources and transactional update is possible, then transactions should be used by the ESB to ensure consistency of state.

This guideline always holds when dealing with "homogeneous" resources (e.g.,  tables in the same DB,  JMS destinations in the same JMS provider, etc.).  Instead, if distributed (XA) transactions would be required, then the data consistency advantages that they bring should be carefully weighed against complexity, performance, and scalability considerations.

Lastly, it is important to remark that using transactions does not remove the need for idempotency because:
  • Is is still a good idea to guard against the risk that a client or source system sends a duplicate request by mistake, even if the original request was processed correctly
  • In the case of an "in-doubt" transaction we must be able to reprocess without concerns

 

 Compensation

As mentioned above, compensation logic is required when we need to keep related resources mutually consistent at all times and transactions are not an option.
This is far from an ideal solution because, among other things:
  • compensating updates (or "compensating transactions" as they are often called) add complexity and may in turn fail; if the original update failed for technical reasons then compensation will also most likely fail;
  • many systems of records (such as ERP systems) cannot delete a committed transaction such as financial booking, so the cancellation must create an offsetting posting, that exists purely for technical reasons
Therefore, compensating transactions are often avoided provided that the update that's "missing" due to the failure can be made to succeed within a reasonable time (via error reprocessing).
In other words: if the purpose of our integration is to update resources A and B in a coordinated fashion, and only resource B could not be updated, then our purpose is to update B as soon as we can as long as the temporary inconsistency due to the fact that A is update and B does not have a real business and/or regulatory impact.  We should use compensation only when even a temporary data inconsistency is unacceptable .


Error notification 

Generation, routing, and dispatching of error notifications or alerts are part of basic integration infrastructure, and the way they are implemented must of course fit into the organization's Incident Management process.

Please note that error notification is a functionality that is logically separate from logging (the latter is always necessary for both sync and async integrations).   From a technical standpoint,  error notification is frequently triggered from the logging logic (typically based on the severity of the logging entry), but logically it is distinct due to its "push" nature.

Many implementations are possible among which:
  • direct notification  to a support group (via old fashioned email or more modern channel such as Slack)
  • indirect notification though error logging and delegation to a log analysis / operational intelligence tool like Splunk
  • direct injection into an incident management / ticketing system  
Each of these approaches has pros and cons (which may be discussed in a future article), but for sure we must have the notification functionality covered in some way.

For synchronous integrations, error notification is limited in most cases to returning a useful error response to the client.  Good practices are easily accessible online for this (see for example the error response of the Twilio REST API).  Notifying individual errors to Support is not normally done for synchronous interactions, as it is the client's responsibility to act on every error.
Nevertheless, logging and monitoring are key to immediately identifying technical problems that affect the health of our APIs and their adherence to SLA's.  API gateway products can really assist here.

With asynchronous integrations, every non-transient error normally triggers an error event that must be suitably routed and dispatched.  Many criteria can be used for routing and dispatch of such error events for example:
  • the identity of the integration
  • organizational information in the context of the integration that failed (e.g.,  company code)
  • the priority of the source message (regardless of how its is determined)

Sophisticated error handling logic may be able generate a single notification for a burst of errors of the same type that occur within a given (short) period of time ("error bundling").  This kind of logic is essential if we want to directly integrate error events into out ticketing system, and avoid flooding it with a large number of identical errors as a result of an outage. 

 

DLQ and Error Recovery

Most asynchronous integrations are "fire-and-forget" from the standpoint of the source system: if a message can be successfully sent, the source system assumes that the integration logic will take care of the rest.  No response is expected.

Even when asynchronous ACK messages are interfaced from the ESB back to the source system,  they are mostly used for audit only.

Since in the systems integration domain we are mostly dealing with business critical flows,  every single non-transient error (including transient errors for which retries have been exhausted) must be notified (as explained above) and "parked" for later error analysis and recovery.

All information for error analysis and recovery must be placed in a persistent store that is commonly called a Dead Letter  Queue (DLQ), although its does not really have to technically be a queue.

Some considerations:
  • it must be possible to browse the DLQ by support staff
  • if must be possible to quickly find the DLQ entry from the contents of an error notification (which ideally should contain some kind of URL pointing to the DLQ entry)
  • the DLQ entry should link to all available information to diagnose the error including logging entries created during the execution that led to the error (see for more info my earlier blog post on logging)
  • in order to be able to replay a failed integration  the complete runtime context must be available in the DLQ entry and not just the message payload.  For example, an integration invoked via HTTP normally needs URI, query parameters, request headers etc., just re-injecting the HTTP body will not work in general.  An integration triggered through JMS will need the JMS header etc.
The topic of error recovery as part of Incident Management is not a trivial one and cannot be be fully covered here.  However, a few points can be noted:
  • As already stated, the reprocessing a synchronous integration can only be initiated by the client which originally made the failing invocation.
  • In non-BPM integration solutions, it is practically impossible to resume an integration execution halfway, as there is no way to resume execution from a  reliable persisted state.  This possibility instead exists when using process-like ESB tools such as TIBCO BusinessWorks  (with proper checkpointing implemented).
  • For asynchronous integrations, reprocessing on non-transient errors can in theory occur either from the source system or from the ESB's DLQ.  However, the correction of data-related errors requires submitting a corrected (altered) source message and this should not be the responsibility of the team supporting the ESB (no manual manipulation of production messages should be allowed!)  Therefore, the reprocessing of such errors needs to be initiated in the source system.
  • Idempotency is a necessary pre-requisite for reprocessing when partial updates have occurred that could not be rolled back or compensated for. Example: if an error was thrown on updating target resource B after resource A was successfully updated, then it is not possible to replay the integration execution as a whole unless we can rest assured that no duplicate update of A will be made.
  • The risk of "stale updates" is always present when reprocessing master data synchronization messages.  Example: if a message synchronizing product data for product X fails (for any reason) and is parked in the DLQ, we should not reprocess this message if more up-to-date messages were successfully interfaced for the same product, otherwise we would be updating the target with old data. 

 

Analysis of relevant cases

This section matches the categorization of integrations / resources / errors with the necessary error handling / recovery feature. 

To make things more understandable, the following decision tree can be used:

The coding clearly shows what feature are required in which case.

The rationale can be inferred based on all the discussion presented so far, but is still worth while emphasizing the key points for synchronous and asynchronous integrations.

 

Synchronous integrations

Automated update retry capability on transient errors is normally not applicable (reason: most calling clients cannot afford to hang and wait for seconds for a response without serious repercussions on their side).

No DLQ is necessary as reprocessing must be initiated from the source.
Logging and monitoring are necessary although error notification is normally not done in addition to the required error response to the caller.

Idempotency (whether supported natively by the targets or added in the integration layer) is always necessary when the integration update multiple resources and transactions are not possible.
Even in the other cases, idempotency is useful when "in-doubt" situations arise.

Finally, compensating actions are to applied only when transactions are not possible and at the same time it is mandatory that target resources keep a mutually consistent state at all times.

Asynchronous integrations

As its is apparent from the figure above, asynchronous integrations require as a rule a larger set of features related to error handling and recovery.

The first obvious difference with the synchronous case is the use of automated retry logic to mitigate transient errors.

The second big difference is the necessity of some form of (more or less sophisticated) error notification to alert the appropriate actors (support, key users) that something went wrong.

Thirdly, Dead Letter Queue functionality is often put in place to allow repeating the execution of a failed integration with the original input data ("resubmission" or "reprocessing").  Such functionality can only be forfeited if we decide at design time that all reprocessing will always occur from the source system, but is still advisable to still have the DLQ resubmission option available.

Finally, the considerations about idempotency and compensating transactions essentially stay the same as in the sync case.


Conclusions

Hopefully, this article will help  practitioners rationalize integration design decisions based on the technical and functional context at hand.

However since the integration landscape for medium-sized and large enterprises is virtually guaranteed to be wide and varied, it highly advisable that we endow our integration platform with all key capabilities (automated retry, idempotency, transaction support, error notification, DLQ, etc.) in a generic and reusable fashion.
This way, we will be able to easily build in a given capability into an integration when we need it.

Future posts will address how some of these capabilities can be implemented with ESB tools.