Multi-Tenancy in software architecture is like the cornerstone of an enterprise project. You may do it now or just defer to do it later, but it is always going to happen. Any decent scale enterprise project needs a strategy to support multi-tenancy. If not, there is a high chance, someone else will do the same thing and wipe out their business. After all, in an enterprise strategy, there is no point in multiple teams and multiple projects repeating the same thing, reinventing the wheel. So, it begs the question…
Multi-Tenancy is one thing that will one day become the core of the project itself if it ever survives to that day. So how should you plan it?
Many people fall into the trap of believing that multi-tenancy is all about making the services accessible to all the clients via some identifier, usually the “Tenant ID”. There may also be some identifier needed for the database queries, so let’s add the “Tenant ID” there as well. To me, that is nothing but just the tip of the iceberg. Remember this…
So what is it about?
So, what is this article all about? I am a software architect by designation, a full stack developer by heart. I have worked on many projects in more than a decade of my experience from small incubation projects to very large scale enterprise projects. So What am I to say here? I am not going to tell you how to do or not to do Multi-Tenancy. It is a very subjective question and there is certainly no magic pill. So, certainly, this is not the quick guide for the shortcuts. I am here to just share my view of why Multi-Tenancy is a multi-faceted architecture problem and it deserves its fair share of attention!
Let’s start with the simplest and most obvious of things - “Access Patterns”. I think we can agree that the very 1st thing people explore in multi-tenancy is the access pattern. You know, “How exactly will you identify the tenants?”.
At the forefront, it seems to be the simplest of things, Right! There are after all one or two ways to do it, how does it even matter. The answer could be pretty standardized, but it certainly has a lot of implications in the long run, especially when it entails an end customer-centric view. Let’s go further…
When it comes to supporting multiple tenants for a web service or a web application, URL descriptors are pretty standard ways. It only needs a new identifier to be added to the URLs, usually, a unique “Tenant ID” descriptor, and you are done. Right? That’s how almost everyone does it. But, wait it should have some implications, isn’t it?
One of the challenges that we see with the unique identification at the tenant descriptors is the possibility of information leak. Now, this might not be a big deal for you as an organization, but it is certainly for a lot of domains. Healthcare for example is very sensitive to this kind of information leakage. It is certainly not apt for a set of users to see a unique id with every request that they make that can uniquely identify their tenants. Phishing attacks or in other cases, black hat hackers are always on the look for some information that can uniquely identify a set of users in an organization.
It is as simple as identifying that “If the User A I know happens to be using the Humana Health Plan and I know the access patterns of these users, User B also seems to be having the same descriptors. Can we assume that both users are enrolled in the same health plan groups?”. Well, there are techniques to handle that as well, specifically URL re-writing but let’s save it for another day.
Also, "how would we handle white labeling in these cases?". Can we "completely 100% ensure that no 2 tenant data are going through the same servers for security?". Can we "completely put the services for 2 tenants in different geographies?". Can we use a "hybrid cloud approach for 2 tenants if required?". There are several subtle questions like this which you as an architect needs to figure out.
Then there is another popular choice of using the “Application Headers”. I am not sure why but for some reason it has started to gain so much popularity. I have seen that approach used in plenty of projects, but I don’t see much real value out of it.
Let’s imagine a service that takes in the unique header for every request that is sent to it. One of the challenges that I see with this approach is the lack of “Intelligent Proxying”. A lot of applications today, especially in the cloud domain relies on Internet Gateways, Load Balancer, and some sort of Edge Caching. What it means is that some of these gateways also need access to these headers for intelligent routing, scaling, or caching?
But, unfortunately, in an ideal setting, you would need to encrypt the payload using SSL encryption. This also means that the edge device which needs access to these headers is now encrypted. One way to handle this is to terminate the SSL right at the edge. However, from a security standpoint, it can only be done if you trust the rest of the network to let your traffic go unencrypted. It is certainly ok to do so in some cases, but it has started to become an anti-pattern because of the risks associated. There is a growing focus on using zero trust policies in applications, leveraging cloud-native functionalities, becoming truly serverless, which does not encourage it. You can also re-encrypt it at the edge, but there is cost associated with each request. And then there are problems for maintaining multiple SSL certificates, renewing those yearly and all. You get it, there are implications!
But wait, what about unique domain names. It sounds great! Well, it is but with its bag of issues. You see using a unique domain name is costly, from an operational perspective. Since a unique domain name has to be registered itself in an external authority, it imposes a certain set of challenges for operations. It is indeed a great way to enable white-labeling of the application, to let your clients have the feel of there own services. But, only if it is a viable business sense. A unique domain name would also require the traffic to be routed via an additional reverse load balancer like NGINX, or Azure Application Gateway. It also imposes some of the above problems of encryption like separate certificates or using a wild card certificate which has its security flaws.
Scaling or Selective Scaling
True multi-tenant product viability relies on how well it can scale to all the new set of business users. If the product cannot take in new users without much impact on responsiveness and scalability, then it probably is not built for it. Like with any software architecture, it demands the same attention and the same principles of scaling but with some additional gotchas.
Why does it matter?
With a typical software architecture, we think of scaling in terms of unique performance numbers. “If my application can respond under 100 msec for 2K users, what needs to be done to get the same response time with 20K users”. But, it’s very unlikely that in the long term a multi-tenant solution would ask the same questions.
In a tenancy based platform, each tenant has a different scale of operations. While the above performance and scale metrics are applicable, they have varied performance and scale needs. Each tenant comes with its own set of SLAs, SLOs, and own access patterns.
What is even more important is that in a multi-tenant ecosystem, a tenant might be very sensitive to the loads from other tenants making the performance metrics on increased load un-predictable – also called “Noisy Neighbors”.
It is also imperative that the performance and scale of any solution are essentially bounds defined by one or more bottlenecks. These bottlenecks could happen to be any step in the execution pipeline. It could be a compute operation like a serialization or encryption step. It could also be one or more SQL queries or a state-full WebSocket session for example.
Although in a multi-tenant setup, there could be more than one such independent pipeline steps. This is primarily because each tenant has different access and usage patterns. One of the basic examples could be drawn from a “Multi-Tenant Chat Platform”. In such a platform, while one tenant could be using the chat application from a mobile phone with a custom WebSocket protocol, while other tenants could be using a Facebook channel using a Webhook protocol. From a laymen's perspective, it seems trivial, there should not be any problem. But, if the edge channel layers are the same, it is going to consume a very different set of resources from each client. It may lead to very unpredictable performance and scale metrics for the tenant using the Facebook channel even for a small load.
An architecture that cannot handle the individual demands of the tenant is no good on the barometers of multi-tenancies. In other words, if supporting client A can make the services for Client B in a tough spot, that is a no go.
There is an entire utopian literature of reactive event-driven systems that allows the applications to be elastic, performant, and scalable. But, let’s face the fact, not even 1% of the actual products are made reactive and event-driven end-to-end. And even if they do, one of a client can still congest the entire ecosystem, unless you partition everything, and by everything. I mean everything!
What about Databases and other database like things?
A very similar set of problems also exists for the database systems. Any data processing system, i.e. cache, databases, message brokers, etc. will face this issue. If all the tenants in the application are going to use the same data processing systems, it is inevitable to face performance and scale issues. One can partition the database for each tenant based on the “Tenant ID”, and use some sort of partitioning, but it is still half baked solution.
I believe it is one of the most misused terms in the application architecture. The database partitioning, viz. sharding has a very different purpose to serve. A partitioning scheme allows the data reads and writes to be done in a load-balanced fashion. What a database cannot do with the partitioning scheme is controlled resource allocation.
The database has no clever way to differentiate the resource consumption needs based on the partitioning scheme.
Then there is also a problem of uneven distribution. If 2 tenants have a similar data size load, it will lead to an even partitioning of data among all nodes. But, if the same 2 tenants have a largely different run-time load, not every tenant can scale irrespective of available capacity.
Let’s also look at the shared service, can they be reused? It depends! If the service is merely responding to the requests over HTTP and does not incorporate any eventing mechanism, it’s hard to scale selectively. On the other hand, an event-driven architecture will continue to poll new events irrespective of who is sending it. Another prominent way of handling this in an event-driven system is to partition the event source for tenants. The service client can then use priority-based polling or techniques like auto-scaling to selectively serve the tenants.
It is also worth looking at offering reserved isolation to few or all tenants. A lot of products do that. Take a lot at Azure Messaging Services for instance. It has a pay per use model, that partitions the data for different tenants for the sake of security, elasticity, scalability, and so on. It does offer fantastic SLA as well. But, it does not guarantee an internet-scale performance in case the other tenant goes crazy. They do offer premium space with reserved capacity for you though.
How did I miss this, Kubernetes, Swarm, Open Shift, and all new crazy new container orchestrators have solved all these problems. Isn’t it? After all, they offer Linux CGroups partitioning of resources, docker composability, dedicated performance bounds, and auto-scaling. What else do you need? If you are thinking yes, it’s not so the case, at least entirely!
Billing and Cost
From the operational and business viability standpoint, the most important factor for a multi-tenant solution is the cost savings. It can be counted as the development cost, infrastructure cost, or opportunity cost. After all, there is no point in re-inventing the wheel and solving the same problem by multiple teams if it can be done just once. Unfortunately, how the billing would happen is one of the last dimension development teams explore.
Every project does not start with a cloud offering at hand.
Almost every project has a humble beginning. Planning for separate metrics and billing means that everyone has to think for every damn incurred costs and a way to make a delicate chargeback model. Have you ever explored the AWS chargeback model? It is good enough to scare away an intern.
Cost and Billing are a very tricky aspect and like every other dimension, you can do it many ways. A very simple and elegant solution is to have an API metering and a billing system in place. It allows you to selectively query the workload incurred by each tenant. There are plenty of solutions already doing that. Take any major API Gateways and you will find some sort of operational dashboards to let you extract in-depth reports for the API Usage. But, a million-dollar question is – Is that something your business agrees?
Take two clients for example. Client A, a very humble client still trying to figure out its business model. What the client is looking for is just access to some of its data, say via chatbot. But, since the business is not kicked off, what he is looking at something that works reliably. He is not bothered and interested in tons of other bells and whistles attached to the product. In-depth performance monitoring, strict SLAs, elastic workloads, strict SLAs, and dynamic serverless ecosystems are things he wishes only if he could need. On the other side, Client B is the one who needs it all. A typical API chargeback model might have to flatten out the cost, unfortunately.
Another popular choice people go with is some sort of shared infrastructure cost. I think it is one of the most trivial mode products choose at least from the tactical perspective. But, as you may guess it does not play well in the long run. A multi-tenant product has to deal with different types of workloads from different tenants. Hence, there will certainly be over cost allocation to few.
Agility and Fragility
As an Architect myself, I see one of the most ignored aspects are the agility and fragility. The current modern software development approach is completely focused on Agile practices and the concept of Agility. While I like this idea and I am a complete fan of the approach, it often tends to force us to make tactical decisions, that are no good for long.
In my humble opinion, databases are the most fragile things to handle in software architecture, so why not start from there. Unless you are making a Tic-Tac-Toe game, chances are you are building a data-driven application. A data-driven application is not a data engineering application, but it is rather any application that is enabled by the data it processes or rather enabling access to data.
I have seen a lot of application teams making hard coupling in the database, or make modeling choices which later on breaks as the new functionality grows. It is often quite ok to expect that new tenant data or access patterns might be something, that is going to be different than the existing tenants. Just adding a unique “Tenant ID” to the data is not enough. Forget about scaling, distribution, and sharding aspect, the data consistency might itself be at stake.
As part of the data modeling process, we are often forced to make choices about the data model, i.e. the schema. If we look at the debate between the relational and the document databases, there are 2 approaches – “Schema on Read”, or “Schema on Write”. But, Schema always exists! So, what is the way forward for a multi-tenant solution?
Very broadly, there is 2 school of thoughts from the modeling perspective, aggregate or normalized. From the relational world approach, we tend to create database granularity as small as possible. It allows us to enforce strict rules at the “write time”, and ensures that new features can be added safely. However, that safety often comes at the cost of migration costs. That is to say, the data is the thing where we revolve and create different dimensions of reads/writes when required. It allows the data models to grow beyond the existing dimensions, which allows the extensibility at a later time. But, it also poses several limitations. What if we assume that there is going to be 1:1 mapping between 2 entities, and a new client is expecting a many to many mapping. Are we going to make hard migrations in the data that is potentially going to break existing functionality, or at least put it at stake?
Then, there are aggregate oriented databases. I wonder, how almost every other project uses MongoDB irrespective of whether it suits it or not? The term Schemaless is I guess the key. But, it is abused, to be honest! We will go about that some other day, but for the most part, using a document structure in a database also doesn’t make it agnostic to migration problems. It simply takes away the problem from the database and hands it over to the application. If the data is structured and de-normalized in a certain way acceptable to one or more clients, it might not very well be for the other clients.
API Versioning is another misused concept when it comes to multi-tenancy. While API Versioning is an extremely important concept, it is just part of the solution. Things like API Versioning enforces that backward compatibility is ensured, especially when you need to support more than one tenants. It offers cushion for the backward compatibility in case of incrementing changes. What it does not do is protect against accidental breaking changes in the database schema.
Another important aspect of agility and fragility is the services in the “Micro-Services”. One of the foundational stones of a micro-service architecture is the idea that these services can be self-contained, manageable independently, and hence will follow a strict contract. A business transaction might have to follow one or more services for the fulfillment, which means any breaking change in between is catastrophic. This essentially forces one to think about not only the versioning and contract changes for the end-user but all the internal microservices.
Security and Insecurities
Multi-Tenancy might leak a security gap in the entire ecosystem which is worth considering, especially for high-risk domains like HealthCare. Security and Vulnerability assessment is not just something that can be considered out of the box. It has multiple facets to it in a multi-tenant solution.
When people generally look at security, there focus usually stands on the authentication and authorization of the tenants. It is probably obvious to say that it requires dedicated security credentials, security scopes, and unique identification strategies. It would certainly not be acceptable to have a security breach and not identify which all tenants got affected by this if we do not have an audit functionality. Also, it would ideally be foolish to let the security credentials like the client credentials for B2B transactions be shared, as it would compromise the entire system in case of mishandling. When cases like this, it is always best to follow the “least privilege principle, even for a B2B token”.
While the above are all essential to have, multi-tenancy has another important and ignored aspect, which is the data security, sovereignty, and network security itself. One of the essential problems that we often assume while building applications is that once we are inside the application, we are all safe.
Unfortunately, that’s often not the case. Imagine running inside a Kubernetes cluster with the deployment of multiple namespaces for each tenant. Kubernetes for example does not offer a hard multi-tenancy, that is to say, the namespaces are just logical segregation for different tenants. One potential vulnerability in one of the tenant can compromise the entire network, be it data, encryption layers, security tokens, network, nodes, docker, etc.. Obviously, there are various ways to handle it like the Istio proxy server, Zero trust policy, etc., but the idea is that all the layers are potentially vulnerable.
Isolation and Sharing
Multi-Tenancy is fundamentally based on the fact that we can leverage the shared ecosystem for all the tenants. While that is good from the cost perspective, it does offers some challenges like the network and data security. Let’s also look from an infrastructure standpoint and how different models have evolved.
Cloud Ecosystem is one of the biggest innovation in the last decade. It has moved us from bare-metal machines to serverless. Over the last couple of years, there have been many abstractions created. From hypervisors to PaaS services like Azure App Service, Elastic Bean Stack, to serverless like Lambda, Functions, to container orchestration like Mesosphere and Kubernetes.
Now, the fundamental question is what works best for a multi-tenant solution. There is no one size fits all solution to it, but worth a deeper thought.
I would like to break the problem into “How much trust and segregation different tenants would like to have? How much of zero-trust security can individual tenants handle? What sort of fundamental regulatory compliance or other data sovereignty constraints are applicable? Is any of the tenant extremely reactive to even a minor drop in performance or responsiveness, etc.?” We can go on an on…
Over time different organizations, have grown themselves into different levels of multi-tenancy. I believe the SaaS model is essentially a very basic one where we can start. It essentially does not offer any guarantees on the performance, spikes in load, and guards against noisy neighbor's problems, etc. But, without any doubt, it is very fundamental to get it done. If all you are building is a lousy application with a lousy set of concurrency, scale, and security requirements, I would say don’t bother much.
But, if the requirements are more stringent like the selective scaling and absolute protection of data from other tenants, the SaaS model does not work well. A PaaS based tenancy model would be better suited. What you will essentially have is offering a platform where different tenant services would be running. You can have a shared domain layer for one or more of the service consumption, but that might again become a bottleneck as we discussed before. I would prefer to go with some of the existing “Platform as a Service” solutions like Azure App Services, AWS Elastic Bean, Red Hat Openshift, etc. because they offer strict boundaries between entities of different tenants. What it means is that a large part of the network and infrastructure security and isolation is shielded from you as a developer team.
There is also an increasing love for container orchestration engines like “Kubernetes”, “Mesosphere”, “Azure Service Fabric”, etc. and I have also tasted this wine. To be honest, it’s not worth the effort unless you know what you are doing. The additional overhead of managing the infrastructure, adding security policies, defining policies for blue-green deployments, creating service mesh, handling security realms, cluster manager service accounts, handling cross-region failovers, and all the ton of other orthogonal operational overheads added is humungous. Moreover, a lot of these orchestrators are not built from groud from the multi-tenant security in hand. Take for example Kubernetes, it does not have any concept of physical isolation in a multi-tenant setup. Well, all this said, these orchestrator engines are extremely good, and reading some good case studies you find production grid clusters deployed with these engines spanning up to 1000s of nodes. So, knowing what you are building and what does it offer is essential to understand the isolation your tenants desire.
Lifecycle – Versioning, Development, and Release
A developer spends almost 90% of his time in support, maintenance, understanding other’s code, doing the review, design, etc. with only 10% doing development. A project lifecycle as would know is not mostly development, but a ton of other things facilitating the business continuity. So, which all aspects of the project lifecycle does a multi-tenant solution touch and how might it impact?
I have seen several projects building a set of features, that suddenly start getting a lot of traction. As a business leader, the stakeholder's job is to sell the technology and product to a lot of clients. The irony is the delivery expectations. If a thing is working for one client, it should just take a couple of days to spin up for another, right?
The challenge with that approach is not thinking about the multi-tenancy design in the first place and then trying to force-fit the architecture for multiple tenants. Now, this problem fundamentally impacts all the dimensions, for now, let's stick with a couple of gotchas.
First thing first, if we end up maintaining different code bases for different tenants, it is no multi-tenancy. You can call it some sort of fancy model of consulting and business development, which is fine, but not a true multi-tenancy model. Second, we have also seen people adding a lot of hard assumptions about the entire functional features, database models, access patterns, etc. for different tenants, which might not go as expected. The challenge with this approach is that we might see a short bootstrap time for new tenants, but it starts creating a lot of problems for existing tenants.
I believe, as the title says, we need a fundamental fair-share of attention to the problem for multi-tenancy itself. But, for those already in the process, few things might help. API Versioning, End to End Automation, Zero Touch Onboarding, Feature Toggling, Pluggability, Micro-Service domain driven development, Fully functional automation functional testing, externalized configurable modules, modularized architecture and tons of other best practices are your friend.
It also means that explicit focus should also be spent on how some of these features will be rolled out. Setup of blue-green deployments, multiple back portability version deployment, full regression acceptance testing, strict backup and recovery process, multiple release rollback processes, etc. are some of the things that should be added. It is needless to say that irrespective of any best architectural practices, pleasing all of the clients all the time is not possible. So, the capability of supporting multiple versions of the entire architecture and all services is your best bet.
Monitoring and Observability
We know that no product is great, it is the team that makes it great. For a team to be great, it needs its tools and the magic wand! What you need is the power to look deep into all the layers when you need it.
Observability, Monitoring, and Tracing are the magic wands that everyone needs in a project. When we add multi-tenancy to the problem, you cannot find any other replacement for this. Production support for complex systems where several tenants are using it at the same time is a nightmare. While one of the tenants might be complaining about the performance, others might not have any issue at all.
End to end tracing tools have come a long way over time. Tools like NewRelic, DynaTrace, Prometheus, Jagger, AWS X-Rays, Azure Application Insights, have tons of features you can use out of the box. Some of these tools have got so much traction, that there now a continuous effort in the unification of the standards via OpenTracing. While tools like these provide auto instrumentation of the entire ecosystem, it certainly needs additional leg work from a multi-tenancy perspective. To start with, all these tools are security-aware, i.e. they will try not to instrument the data that is passed. You would need to additionally add tenant and domain identifiers with each request so that you can distinguish clearly between various tenants in the production. Additionally, it might also be worth to create live reports which are segmented by tenants to identify the weak points per tenant.
A tracing tool is of no help if you cannot identify it with the application code at a line/module level. It means that another set of tools that is certainly required is the bundle of centralized log aggregation. Tools like Splunk, ELK, Azure Log Analytics, AWS Cloud Watch are some of the examples. From a multi-tenancy perspective, they also need access to additional tenant identifier per each request to figure out the bottlenecks. It is also best to add additional tracing identifier to correlate logs with the trace information.
As I said, it is not a guide for how to do it right! Rather it is just my views on “Why Multi-Tenancy deserves a lot of attention?” I also added a couple of thoughts in some dimensions which might be worth having a look at it, but certainly, it is far from all. Feel free to let us know your thoughts!