Evaluating message brokers – Amazon SQS

Message brokers are specialised pieces of enterprise middleware designed to support integrating applications in a decoupled (in both time and location) manner, using messaging channels, implemented as either queues (for single consumer) or topics (for multiple consumers). Whilst deploying and operating a broker has an additional, ongoing cost for a business, if the scale of integration in your system and the non-functional requirements warrant it, they can can provide a flexible, better performing and more scalable solution than the alternative of implementing message queues in your database, especially, as is often the case, the latter is already overloaded.There are a considerable number of proven message brokers available today from a variety of vendors. Before committing to building your integrations on a particular broker, you should give careful consideration to how well it satisfies your requirements. I recently went through such an evaluation exercise, and ended-up choosing Amazon Simple Queue Service (SQS). While we haven’t regretted this decision, I did learn a few things along the way. In this post I’ll share a list of the functional and non-functional (technical) requirements that you should consider as part of evaluating a message broker, and also my opinion on how SQS measures up to other brokers in each case, and the trade-offs.

Amazon SQS

Before we get into message broker evaluation and how SQS fares against others, I’ll briefly recap the major characteristics of SQS (as of 05/2016).

SQS is a distributed, queue-based, messaging service. Let’s break that description down and consider what it means.

SQS is a messaging service. For an ongoing usage fee (see below), Amazon deploy and operate the message broker for you, including monitoring, maintaining (patching, housekeeping, backup etc) and scaling the message broker. This is an alternative to deploying and operating a product-based message broker yourself.

SQS is queue-based. It supports building (loosely coupled) integrations between two applications, in which messages only need be consumed once (by a single app), via a queue of messages.

A relatively basic set of queue operations are supported including publishing a message (with optional delay), reading a message (with support for exclusive consumption based on a message visibility timeout), and deleting (acknowledging) previously read messages using a read receipt. Given that the broker is operated as a remote service, these operations are exposed as a set of web service APIs (Send Message, Read Message, Delete Message).

SQS is distributed. Each queue is distributed across multiple Amazon SQS servers, resulting in a copy of each message being stored in multiple locations for redundancy. This is illustrated in the following diagram taken from the SQS Developer Guide, which shows a single queue, containing messages (A-E) –

amazon-sqs-distributed-queue

SQS has been architected this way to achieve high levels of scalability and redundancy. This is essential to meet the requirements of most businesses. However, the technical architecture required to support these requirements comes with trade-offs in terms of limitations in the feature set that’s supported by the message broker, which is the reason for the ‘Simple’ in SQS.  These trade-offs and limitations are discussed in more detail below as part of evaluating individual requirements.

A suggested list of functional and nonfunctional requirements you should consider when evaluating message brokers include:

Each of these requirements is elaborated below, along with my observations on how well they’re supported by SQS.

Messaging Styles

A primary consideration when evaluating a message broker is how well it supports the styles of messaging (types of message channels) needed to meet your requirements.

If you want to build (loosely coupled) point-to-point integrations, in which messages are only consumed by one app; and / or  you require a queue of tasks for worker nodes to process asynchronously, then your message broker needs to support message queues.

If you need to integrate several apps, using multiple message types, and some of those messages needs to be consumed by multiple apps, as might be the cause in an event-driven architecture, then your message broker needs to support the publish-subscribe (pub-sub) style of messaging, typically using topic-based message channels.

Our immediate requirement was only for queues to support a worker queue for a point-to-point integration, however with one eye on the future we were also interested in pub-sub to reduce the no. of point integrations we were building, and allow integrations to be extended without changing the publishing apps.

SQS only supports point-to-point messaging via a queue. However, pub-sub messaging can be achieved relatively easily by integrating SQS queues with Amazon Simple Notification Services (SNS).

Message Payload

Support for the payloads you need to include in messages exchanged between your apps is one of the first criteria to consider when choosing a message broker. What content / media-types do you need to be able to send, and what’s the maximum size of the payload you need to send? For example, do you need to be able to include binary payloads in your messages, or if not, what character set(s) need to be supported?

Amazon SQS supports ‘text’ (non-binary) message bodies only, including most of the characters in the Unicode character set. (The supported character set is defined by the XML spec, because under the hood SQS uses XML for its data-exchange format). If you need to send binary payloads then your message producers and consumers will need to agree on and apply a binary encoding scheme (e.g. base64) themselves. (Binary SQS Message Attributes are supported by SQS, but these are akin to message headers, and using them instead of the message body to store your payload, whilst feasible, will reduce the efficiency with which messages can be consumed).

By default, SQS’ max message size is 256KB, including Message Attributes. Is that big enough to support your use-cases? If not, there is support for sending larger payloads of up to 2GB, but it requires using an alternative ‘Extended’ client library, which stores the data ‘out of band’ in S3, which has additional costs. As of now, I don’t currently have a lot of experience of this option, so can’t comment on the trade-offs, such as of latency. For more details see the Managing Large Amazon SQS Messages Using Amazon S3 section of the SQS Dev Guide.

In our case, all of the payloads we needed to send were text-based and relatively small (a few KB), so the SQS message payload restrictions were not a showstopper.

Latency

When integrating your applications via a message broker, rather than directly, you’re adding at least a couple of extra network calls between the apps, as well as some additional processing overhead for the broker to store (assuming durability is required) and route the messages.

When evaluating message brokers you should therefore measure the latency they introduce and check whether it’s low enough to meet any performance requirements you have. This includes the response time of any synchronous requests for enqueuing the message before returning a response. Also the maximum time for acting on a message, which includes the time it takes for the message to transit end-to-end through the broker and be available for reading from the destination message channel.

The latency introduced by different message brokers will vary depending on a no. of factors, including whether they’re deployed locally or remotely, and how they’re architected.

SQS is a remote service with queues accessed by web service calls made over the Net. The latency for sending and reading messages will therefore always be an order of magnitude higher than a product-based message broker that’s deployed in your data-centre, on a node on the same high-speed LAN as your apps. The SQS FAQ (Features, Functionality, and Interfaces) states that typical latencies for API requests alone (excluding any transit time in the broker) is “in the tens or low hundreds of milliseconds”. The latency will likely be at its lowest if you deploy your producing and consuming apps to EC2 in the same region and AZ as your SQS queue, as traffic should be routed within the AZ rather than transit the internet.

In our case, latency was an important consideration in our choice of message broker, as messages needed to be processed in near real-time to support an interactive use-case, and messages were being published by a synchronous API. Prior to adopting SQS we ran some load tests to check the latency.  As with other AWS services, we found the latency was higher when the queue is used sporadically, but improves significantly as you place the queue under constant load. What we found was pretty consistent with the aforementioned statement in the SQS FAQ. In 95% of the cases, new messages were available for processing on the destination queue in 150ms, and in 99% of the cases, 300ms.

These latency times were acceptable for our “near real-time” use-cases. However, if you have more challenging requirements e.g. latency times in the 10s of ms, then SQS is not the right broker for you.

Availability

If all or part of your message broker, were to become inaccessible (i.e. you couldn’t publish and/or consume messages from your queues) what impact (cost) would this have on your business, and how would that increase with the length of the outage?

If the message broker supports a real-time service for end users, as is true in our case, then even a short outage could have a big impact on your business. In such a case you need to check what mitigating high-availability (HA) features your message broker supports. Support for HA is itself a trade-off as it incurs additional complexity and hence cost.

Amazon SQS has good HA support. The message broker (queues) is deployed across multiple, redundant Availability Zones within the same region, which avoids a failure of a single node, network or even whole data-centre from making messages inaccessible.

If you choose to deploy a message broker product on-premise, then you’ll need to setup and operate it in an HA configuration. This typically entails clustering instances of the broker running across multiple nodes with redundant (RAID) storage. This is a non-trivial (complex) exercise, which requires operational experience and testing to achieve a reliable solution.

This should be taken into account when comparing the cost of a self-managed, on-premise message broker with messaging-as-a-service like Amazon SQS or Google PubSub.

Scalability

Do you have non-functional requirements for throughput (transactions per second (TPS)) that your system needs to be capable of attaining? What are your (average and peak) message rates? And what is the projected growth in the volume of messages?

Message brokers are centralised infrastructure components used to integrate multiple apps. As the number of integrated apps and the rate of messages sent between them grows, there is a risk that the message broker will become a bottleneck. You need to be sure that the broker is capable of not only achieving the throughput required today, but also has the potential to scale to support projected future growth in your system. You also need to consider the cost-effectiveness of the broker’s scalability – how much you’ll need to continue to spend on more hardware to achieve the increase in TPS you want.

The ease and effectiveness that you can scale a message broker will depend on how it has been architected. The need to cluster a message broker to achieve high availability may reduce its scalability, since the broker state needs to be maintained across the clustered nodes.

Amazon claim their message queue service is highly scalable, and capable of scaling up to a throughput of “many thousands of messages per second”.  This is one of the benefits of their decision to use a distributed architecture for SQS. As explained in the SQS Developer Guide to achieve these levels of throughput you need to overcome the aforementioned inherent latency in SQS’ remote API calls, by using two techniques in your messaging apps:

  1. Scale the number of message senders and consumers to overcome the maximum throughput that a single thread can achieve given the latency of each API call.
  2. Send, receive and delete messages in batches to reduce the relative overhead of latency per message.

Before adopting your message broker check its scalability by running a load test that encompasses your message producers and consumers. We did this for SQS and confirmed that by scaling our producers and consumers we were able to easily achieve several hundred of TPS, before we cut the test short.

We also found that SQS’ scalability latency is very low – there was no perceptible delay in being able to scale to higher throughputs. This is in contrast to other services like EC2, which take a few minutes to scale-up compute nodes.

One of the major benefits that a PAYG messaging service like SQS has over deploying and managing your own message broker is how much more cost-effective it is to scale, especially if like us the load on your system is spiky. If you’re operating your own message broker in your own data-centre, then you need to commit to purchasing servers which are capable of scaling to your projected peak throughput at any time of day. But at quieter times of the day, these servers will be underutilised, which is a waste of money.

Redundancy – Guaranteed Message Delivery

What would be the impact on your system / business if occasionally, in certain failure scenario(s), a message sent by one of your apps was never delivered to the target app(s)?

This will depend on the nature of your application. There are certain use-cases where loss of messages can be acceptable. For example, an app which used messaging to monitor the reported status of agents or devices might very well be able to cope with a dropped message every now and then, as a subsequent message will contain the latest device status. However, in many systems, mission-critical or not, such (command or) data loss is unacceptable. In which case, you need to check what guarantees your shortlisted message brokers offer for message delivery.

In our case, guaranteed message delivery was an essential message broker feature.

For a message broker to guarantee that an acknowledged message will always be delivered to a queue, or topic subscriptions, it typically needs to provide the following features –

  • Persistent message channels – If a broker only holds unprocessed messages in local (node-specific) memory, those messages will be lost if the broker dies or is restarted at any point between messages being accepted and delivered to the consumer. (Delivery could be delayed if the message consumer is offline). A common solution is for brokers to offer persistent message channels, in which every message received for that channel is committed to a persistent data-store before it is acknowledged. This inevitably has a performance overhead (reduces throughput), so it’s useful to be able to disable it for those channels where data-loss is acceptable, if that’s supported by your broker.
  • Durable subscriptions – In the case of pub-sub messaging, a message broker needs to offer “durable” subscriptions if it is to guarantee that a message will be delivered to a consumer, even if it is offline when a message is received for a topic to which it is subscribed.

Amazon SQS does offer guaranteed message delivery. More specifically, it offers “at least once” message delivery. The significance of this is discussed in the following section on Message De duplication.

Message Delivery Ordering

For many applications it’s imperative that messages are processed (consumed) in the order in which they were sent – if a message hasn’t been received then later messages either must not (due to business rules), or cannot (for reasons of referential state) be processed.

Many message brokers guarantee that they’ll deliver messages in the order they were sent, but not all do. Therefore, if you have such a requirement you should check whether your shortlisted message broker(s) supports it, and if not consider the implications on your apps.

If a message broker does not support delivering messages in the order they’re received then it will be necessary to handle that in your message consumer, which will add some complexity. There are two possible solutions. The feasibility of one of them depends on how strict your ordering requirements are.

If you only need to satisfy the foreign key constraints of the consuming application’s data schema, which requires that one type of entity, which is created by processing an earlier message, exists before another type of entity, which is created on processing a later message, then you don’t really have strict ordering requirements in a business rules sense. In this case it’s often possible to workaround a message broker which doesn’t provide message delivery order guarantees, by designing your message consumer to support processing messages out of order. You just modify your persistence logic to create the earlier entity in an initialised state, and populate it a later date when the delayed message is received. You just have to be able to accept that your data may be inconsistent for however long a message is delayed, typically a few seconds or minute.

If you really must not or cannot process a message if an earlier message hasn’t been received then your application will have to assume responsibility for reordering the messages prior to consumption. This requires that your message producer (or broker if it supports it) includes a field in the message that the consumer can use to determine the order in which they were sent. This is typically a sequence number that’s included as a message header. The consuming application will then need to detect a gap in messages and either reject or buffer the processing of later messages. In addition to the extra complexity, the implementation of this additional logic has the potential to reduce the throughput of your message consumers.

Amazon SQS only makes a “best effort” to deliver messages in the order they were sent, but it does not guarantee it – it cannot therefore be relied upon. This is one of the trade-offs that arises from SQS using a highly distributed architecture to achieve high scalability and redundancy. As a result, if you have a requirement for processing messages in strict order then SQS may not be the right choice of message broker for you.

In our case, this limitation in SQS was not ideal, but it wasn’t a showstopper. We didn’t have strict message ordering requirements, and were able to workaround it using the approach described above.

Update – Amazon have announced (11/2016) that SQS has been enhanced to support an additional type of FIFO (First-In-First-Out) queue, which does guarantee that messages are delivered in the order they were sent. Use of FIFO queues has some trade-offs (mostly lower throughput) as compared to standard (classic) SQS queues, but the messaging guarantees they offer are compelling for certain use-cases. For more details see the SQS Dev Guide – What Type of Queue Do I Need?

Message De-duplication

In distributed systems integrated using messaging there’s a risk of the same message being  delivered to your app more than once. This can occur in certain failure scenarios. For example, if a network error occurs after a message producer sends a message but before it receives the ack from the broker, the producer will resend the message. Processing the same message more than once will typically either lead to an error or inaccuracies in the consuming app. For this reason being able to detect or handle processing of a duplicate message is often essential.

If your chosen message broker supports detecting and removing duplicate messages it will make life much simpler for you, as it will avoid the need for you to handle them yourself in your consuming app.

Unfortunately, Amazon SQS does not currently offer any such support. (Note – This has now changed see update below). One of the trade-offs of SQS being architected for high-availability and scalability means it only offers at-least-once, rather than exactly-once, message delivery guarantees. What this means is that most of the time your app will only receive a message once, but in certain error scenarios, such as the one described above, it could receive a duplicate.

If you opt for a broker, such as SQS, which only offers at-least-once message delivery guarantees then, even though the risk of receiving duplicate messages may be small, you will need additional logic in your message consumer to handle them. Your message consumers need to be made idempotent – processing the same message multiple times should have no affect on the state of the app. Broadly speaking there are two possible solutions –

  • Duplicate message interception
  • Compensating (business & persistence) logic

Preemptive duplicate message interception

This solution relies on your consuming app intercepting received messages and checking whether they match a previously processed (by any other consumer) one before it’s processed.  There are a couple of strategies for matching a previously processed message. If your messages have a unique ID, then the simplest strategy is to store and query the IDs of all processed messages. Your queries will need to perform well to avoid negatively impacting message throughput. Using a distributed cache to store the message IDs will offer better performance but there’s a risk a duplicate won’t be detected if the cache is non-persistent and restarted, or the cache entry expires before the duplicate is received. The latter is unlikely if the cache is sized appropriately, as most duplicates are likely to be received very shortly after the original message. If your messages don’t have a unique ID, then you could generate and store a hash from other select fields (e.g. date/time, type, body) of the message, and match on that instead.

If you use a dedicated message ID to  classify a duplicate message rather than a hash it has the benefit of allowing you to distinguish system-generated duplicate messages (re-delivery of the same message), from user-generated duplicate messages (different messages with the same content), e.g. double-click submissions.

Compensating logic

If you don’t intercept and remove a duplicate message, then you will need to design your message consumer logic so that if it should process the same message more than once it doesn’t change the persistent state of your application. This solution boils down to using information in the message – whether it be a dedicated message ID, or one or more fields containing business data in the message body – as a business key to a record that is persisted for each message. When a duplicate message is processed the data-store reports a key constraint violation on inserting the record. Your persistence or business logic needs to catch  the resulting exception, classify it as a duplicate, and handle it appropriate. The handling will vary depending on your requirements. Processing of the duplicate might be skipped, or it might need to be applied as an update. Either way the message will be acknowledged to remove it from the system.

The duplicate message interception solution is pessimistic in nature, adding the overhead of dealing with duplicates applied to every message processed, but it can be implemented as a cross-cutting concern, avoiding complicating your consumer logic. In contrast the compensating logic solution is optimistic in nature, only coming into play when a duplicate is detected. It requires slightly more complicated consumer logic, but is overall less complex, as there are less moving parts.

Update – Amazon have announced (11/2016) that SQS has been enhanced to support an additional type of FIFO (First-In-First-Out) queue, which does guarantee a message will be delivered exactly once (duplicates will be detected and removed by the broker). Use of FIFO queues has some trade-offs (mostly lower throughput) as compared to standard (classic) SQS queues, but the messaging guarantees they offer are compelling for certain use-cases. For more details see the SQS Dev Guide – What Type of Queue Do I Need?

Message Retention

Undelivered Messages

How long do you need the broker to be able to hold undelivered messages in a given message channel (queue and topic) with one or more durable subscriptions? And are  any limits on the size (no. of messages) in a queue big enough for your needs? This will depend on the volume of messages you need to receive and the rate you can process them, both during normal operations and outages.

A message might remain undelivered (age) for an extended period during the upgrade of your message consumer, or in error scenarios: an issue with the consumer; an issue with the broker; a communications error preventing the consumer being able to access the broker.

Amazon SQS has no limit on the number of messages that can be stored on a queue. It’s default retention period is 4 days, which can be increased to a maximum of 14 days. This is generous and affords you a lot of leeway for handling outages and processing any resulting backlog of messages.

Indefinite Storage & Replay of Messages

Do you require your message broker to permanently store your message after it has been delivered, and allow consumers to replay the delivery of messages from any point in time?

Traditionally, enterprise message brokers have not offered this feature. The majority of mature message broker products and services (including Amazon SQS) operate merely as messaging pipelines. Once a message has been delivered it is deleted from the broker and no longer accessible.

If you need this feature then you’ll need to focus on a very specific class of message broker. Support for this feature comes with another set of trade-offs to consider, beyond the scope of this post.

Message Filtering

Do you need your message broker to support the Selective Consumer pattern? This pattern allows your message consumers to filter the messages they receive from the message channel by instructing the broker to apply a set of selection criteria (typically to message headers). Where filtering is required, having the message broker perform it rather than consumer is far more efficient.

Support for message filtering can be useful if you have an existing Topic that’s coarse-grained (broad) and one or more consuming applications are only interested in a subset of the topic’s messages. We had our own requirement for a point-to-point integration where support for message filtering would help. Certain messages delivered to a queue were deemed higher priority and needed to be processed more quickly than other messages.

All JMS compliant message brokers support the Selective Consumer pattern. The JMS API allows a message selector to be specified when a message consumer is created.

Amazon SQS does not support message filtering. Your message consumers cannot use the message broker to filter the messages. When you receive a message from a queue, SQS will deliver any message it wishes. This is an example of where the ‘simple’ in SQS applies, in terms of lacking some features of more sophisticated message brokers.

If your preferred message broker doesn’t support message filtering then often the most efficient alternative solution is to remove the need for filtering by redesigning your message channels, and putting enhanced routing logic in either your message sender or message broker. For example, to solve our own requirement we enhanced our message sender to route messages to different queues, using the Content Based Router pattern.

Secure Messaging

When sending and receiving messages between apps you will undoubtedly have some security requirements that need to be satisfied. Consider the extent to which your shortlisted brokers support those requirements. Common requirements for secure messaging include the following.

Secure network communications

It is typically advisable that all communications between your messaging apps and a remote message broker are encrypted in order to verify the identify of the broker and prevent man-in-the-middle-attacks such as eavesdropping. This requires that your broker, and its client libraries / SDK, support running its application messaging protocol over a secure (TLS/SSL) connection.

Restrict access to messaging channels

It’s normally essential that access to messages be restricted to authenticated applications, both to prevent unauthorised access to data in the messages, and sometimes to maintain an audit trail of messaging operations (what operation was performed, on which message, when).  To support this requirement your message broker must be capable of mandating that applications authenticate themselves. In addition, the message broker should have an access control mechanism that allows you to restrict the operations an authenticated app is authorised to perform per message channel.

Secure undelivered messages

The two previous items are common, typically essential secure messaging requirements. In addition, if you’re exchanging particularly sensitive data via messages, you may additionally have a requirement to ensure that undelivered messages stored in the broker (memory or on-disk for a persistent channel) should not be readable by any unauthorised persons or apps which may have access to the nodes on which the broker is running or to which it persists its messages. If this is the case then a message broker that supports encryption of message payloads could be a valuable feature. (Ideally the broker’s support for encryption would be configurable, on a per channel basis, to avoid the overhead of encryption on those channels for which it isn’t needed). Were your chosen broker not to support this feature then your messaging apps would need to assume responsibility for encrypting and decrypting message payloads for selective channels, which adds complexity.

Amazon SQS has strong support for such secure messaging requirements.

  • Secure network communications – All communications with SQS are over HTTPS.
  • Restrict access to messaging channels – SQS integrates with AWS Identity & Access Management (IAM) to support authenticating client apps and access control. Client apps not running on EC2 use a set of access keys associated with a created IAM User as their credentials. (Apps which run on EC2 can instead use an IAM Instance Profile granted to the instance, avoiding the need for the user and credentials). In both cases the IAM User or Instance Profile is granted access rights on a per queue basis. In common with all other AWS APIs, SQS requests are signed using the private key, which not only serves to authenticate the request but also guarantee its integrity.
  • Secure undelivered messages – Amazon announced (04/2017) support for encrypting undelivered messages stored on SQS queues. See the SQS FAQ for more details.

Client Libraries / SDKs

To participate in a messaging system your apps need to integrate with the broker using its APIs. You should therefore consider how well each of your shortlisted message broker(s) support your app development platforms / languages. Does the broker provide client libraries (SDKs) for each of your dev platforms, and  how simple is it to use to access the remote broker?

Amazon provide client libraries (SDKs) for a wide range for most of the popular enterprise dev platforms (Java, .NET, PHP, etc), and the SDK includes support for Amazon SQS. This has a no. of compelling benefits, reducing the effort of integrating with SQS, making use of the SDK a ‘no-brainer’ –

  • Provision of a higher-level object-based API for processing message queues using your chosen programming language, that is far simpler than invoking the web APIs directly and having to deal with lower level concerns, such as HTTP, data serialisation/marshalling etc.
  • Transparent handling of authenticating your app with AWS when accessing the service. You configure the SDK with your public (access) and private (secret) key and the SDK takes care of digitally signing your API requests.
  • Features to increase the fault tolerance of your app and to help you meet your performance SLAs, such as configurable connection and request timeouts, with retry counts.
  • Recording of client-side metrics for SQS related metrics including e.g. request count, latency, errors, throughput, etc.

Message Error Handling

When implementing message consumers you’ll need to handle the various errors that inevitably occur during message processing. To do so efficiently requires a level of support from your message broker. Essential broker support includes:

  • Handling transient errors – a temporary failure in a consumer’s ability to process a message caused by e.g. unplanned outages of services on which it depends, etc. This requires the broker to redeliver messages which fail to be acknowledged by consumers.
  • Resolving permanent errors – This type of error is caused by messages which consumers will never be able to process, no matter how many times they’re retried. This typically entails the broker supporting the configuration of a Dead Letter Queue (DLQ) for a message channel, along with a max no. of message delivery attempts. The consumer should be implemented with its own error handler that classifies a permanent error early and removes the causal bad apple message efficiently without the need for retry. However, if such a custom error handler doesn’t exist then the message broker should remove the bad apple from the messaging system itself, when the max no. of delivery attempts is exceeded.

Most popular message brokers do provide such error handling support, so it’s unlikely to be a major differentiator in your choice, but you should consider it in your evaluation nevertheless.

Amazon SQS provides the aforementioned support for handling transient and permanent errors. For more information, including an overview of how to write a custom error handler for efficient handling of permanent errors see one of my earlier blog posts – Designing message consumer error handler for Amazon SQS.

Cost

You’ll also want to consider the costs of running different message brokers before choosing one.

Operating Costs

The costs involved will vary significantly depending on whether you operate the message broker yourself, or get somebody else to do it for you by using a messaging service (SaaS) instead of a messaging product. This choice between product or SaaS is therefore a key decision to make early on, ideally before you even begin your evaluation.

Operating a message broker entails provisioning and maintaining the data-centre and hardware on which the broker runs; installing and patching the supporting systems and applications software; scaling the broker when needed; monitoring; housekeeping tasks such as backups, etc.

Before committing to operating a message broker yourself, you should review whether your existing infrastructure (data-centre) and staff (ops team) is capable of meeting your non-functional requirements, and if not what it would cost to achieve. Building and operating a highly-available and scalable message broker is a non-trivial task. Total costs include fault tolerant systems (redundant hardware), and the time of experienced operations staff.

Usage Costs

Having decided whether to operate the message broker yourself, and in doing-so making the choice between a messaging product or a managed messaging service, you’re then in a position to consider other costs.

A message broker may have recurring license fees to be paid. Open-source message broker products (e.g. RabbitMQ) typically don’t, but other message brokers, whether operated yourself or delivered using a SaaS model, may charge a fee. This may be based on the no. and size of servers on which the broker runs. Other vendors may charge solely on usage, such as the no. of messages sent or received.

Support and Maintenance Costs

Do you need additional support to operate the message broker to satisfy the SLAs (e.g. availability) of your system? For example, specialist technical support in the event that the broker is not operating correctly (erring) or fails and cannot be restarted.

If you choose an open-source broker, do you need to pay someone to ensure that bugs you encounter are fixed in acceptable timescales?

If the answer to either of above questions is yes, is this support available for your chosen message broker, and how much does it cost?

Amazon SQS

Let’s consider the costs of using SQS as your message broker.

Operating Costs – As previously noted, SQS is a fully-managed messaging service. Like other AWS services there are no upfront costs, you only pay for what you use, and there is no minimum charge. Your operating costs are therefore zero. Instead, as part of the usage fee (see below), Amazon deploy and operate the message broker for you, including monitoring and maintenance (patching, housekeeping, backup etc). You get a highly available, elastically scaling, redundant message broker. For most small to medium enterprises this is major benefit, as it provides access to a very high quality service which would not be affordable to design, build and operate.

Usage Costs – SQS usage is billed based on two metrics:

  • the no. of requests you make to perform message operations (e.g. send, receive, delete)
  • the amount of data you transfer out of SQS (a factor of the no. and size of messages). (This charge is zero if your deploy your message consumers to EC2, in the same region).

Currently, SQS has a free-tier of 1 million requests, for which you’ll only be charged for the data you transfer out. After which, you’re tcharged a fraction of a dollar for each additional million requests you make. The free-tier also currently includes 1GB / month free data. After which, you’re then charged for additional data, the cost of which varies per region. For more details, and the latest costs see Amazon SQS Pricing.

The existence of the free tier offers many SME fantastic value. They can often use SQS for nothing or a small monthly cost of a dollar or so. As an example, if you were to publish and consume 1.5 million messages per month using SQS, all with a 10KB payload, using a sub-optimal batch size of 1 message per batch, then based on current (08/2017) prices you’d be billed as follows:

  • Total requests = 1.5M Send Message requests + 1.5M Receive Message requests + 1.5M Delete Message requests = 4.5M
  • Total data transferred out = 1.5M Receive Message requests x 10KB = ~14.3GB
  • Billable requests = Total requests – AWS free-tier discount of 1M requests = 3.5M.
  • Billable data transferred out = Total data transferred out – AWS free-tier discount of 1 GB / month = 13.3GB. (Note – This assumes the worse case of your message consumers not being deployed to EC2).
  • Cost of billable requests = 3.5M * $0.40c per million = $1.40
  • Cost of billable data transferred out = 13.3GB * $0.09/GB up to 10TB/month = $1.20
  • Estimated bill = $2.60 / month

To estimate your own costs using the latest prices use the AWS Simple Monthly Calculator.

The general downside of most SaaS services is that whilst there are no upfront fees / capital costs, like any ‘rental’ service you risk paying more over the long term. However with SQS the current pricing is so low that this doesn’t currently apply.

Message Admin

Whether in dev or production, sometimes you need to need to browse queues, inspect the contents of a message, move one or more message from one queue to another (e.g. to or from the DLQ), or in rare cases delete a message. Whilst it may not be a big differentiator when evaluating different message brokers you need to at least be sure that the tools you need to perform such admin tasks are available and easy to use.

Like all other AWS services, Amazon SQS has its own web console. This supports managing your queues, browsing the messages on a queue, viewing message details and deleting them.

amazon-sqs-web-console

One of the things that surprised me when I first started using SQS was that it does not support moving messages. If you do need to transfer messages between queues you’ll have to write a program which performs a copy and delete, or more specifically receives the (source) message, creates a new message from the source message’s attributes and payload / body, sends the new message to the new queue, and deletes the original message from its queue. Although you can receive and process messages in batches (of up to 10) this is very inefficient in terms of API calls and bandwidth usage. Being non-atomic it can also fail leaving messages on both queues, and both the original and copied message existing on both queues.

Summary

Before adopting a message broker think about which of the above requirements are important to you, and evaluate how well a broker supports them.

An important early decision to make is whether you’ll deploy and operate the broker on-premise, or use a (SaaS) messaging service externally hosted and operated by a third-party.  Key factors in this decision relate to performance and cost. Can you accept the higher latency of a SaaS messaging service, and, if you need a highly available setup, do  you have the in-house expertise or budget to create and operate it. Additionally, a message broker is often a critical part of your system architecture, so before opting for a SaaS messaging service you need to be confident that the people operating it are capable of  meeting your needs.

Amazon SQS is a fully-managed, distributed messaging service. It was launched back in July 2006. As a messaging service, its operation and feature set are therefore well proven (battle-tested). Its main strengths include:

  • Availability.
  • Scalability (dynamic, responsive and cost-effective) – with no stated limit on the no. of messages stored on any given queue.
  • Redundancy – The fact that a copy of each queued message is stored on multiple SQS servers in different locations means that SQS is very unlikely to ever lose a message. This is one of the reasons it can offer Guaranteed Message Delivery.
  • Cost / Value for Money – Zero operating cost for a highly available, highly redundant service that reliably processes many millions of messages a day. A low pay per use charge, with a generous free tier – typically one or two dollars for several millions of messages.
  • Simple to use – SQS is simple to use – having an API which largely boils down to providing 3 messaging operations – Send Message, Receive Message, Delete Message. Ease of use is improved further by the Client Libraries / SDKs which are available for many dev platforms.

SQS’ major weaknesses mostly stem from the design trade-offs which have been made to offer unlimited scalability and redundancy. The most notable limitations are –

  • Messaging Styles (lack of support for pub-sub).
  • Latency.
  • Messaging Feature Set – For example, lack of support for Message Filtering.
  • Message Delivery Ordering – Messages are not guaranteed to always be delivered in order they were sent, only best effort. Update – As noted above, SQS has since been enhanced to support this.
  • Message De-duplication – Messages not guaranteed to only be received once, only best effort. Update – As noted above, SQS has since been enhanced to support this.

The recent addition of support for FIFO queues is an indication that Amazon are still investing in the service, benefiting existing customers.

If it’s appropriate for your messaging use-cases (e.g. application integration, persistent work/task queue), and you can live with the above weaknesses, then SQS’ benefits are compelling, and you should seriously consider adopting it rather than deploying and operating your own message broker. We’ve been using SQS in production for around a year now, and our experience has been very good – we’ve successfully processed many millions of messages without any issues. We’re planning to use SQS more extensively in the future, including trialing its integration with SNS to add support for pub-sub messaging.

I hope you found this article useful. Best of luck with your future messaging projects.

2 thoughts on “Evaluating message brokers – Amazon SQS

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s