If you've done any research on authorization systems, you've likely come across Google's 'consistent, global authorization system' known as Zanzibar. Zanzibar is an internal Relationship Based Access Control (ReBAC) service built at Google which manages permissions and authorization for major user-facing products such as Calendar, Cloud, Drive, Maps, Photos and YouTube. Zanzibar has scaled to trillions of access control lists (ACLs) and processes millions of requests per second for billions of users around the world.
In 2019, the Zanzibar team published a paper detailing the service's implementation, particularly how it achieved and continues to maintain its performance and reliability at scale over 5+ years of production use. Since then, a handful of companies including Carta and Airbnb have internally built similar systems to power authorization and access control across their own products. There are now multiple commercial and open source Zanzibar implementations (including Warrant) available on the market.
Authorization and access control, particularly in web applications, is not a new concept. Nearly every programming language or framework has built-in or 3rd party libraries to help with implementing authorization and access control. So why did Google decide to build Zanzibar from scratch?
Google's products are heavily ingrained in their users' lives (e.g. email, calendar, maps, photos) so privacy and accurately managing social graphs and access at all layers are central to protecting user data and ensuring a great user experience. For such consumer use cases, traditional, coarse-grained access control paradigms like role based access control (RBAC) don't work. Expressing access control rules such as 'user:x can manage calendar-invite:eng-team-standup' (Calendar) or 'user:x can edit document:y' (Docs) requires resource or object-level fine-grained authorization (FGA). And at Google scale, this means trillions of access rules, necessitating the need for a system that can manage and serve all of them.
From the onset, Google realized that they needed a centralized, fine-grained authorization service for its product teams to use. For one, the cost overhead of implementing and maintaining the same authorization logic within each product (e.g. Maps, Docs, Photos) didn't make financial sense. More importantly, multiple implementations would dramatically increase the chances of bugs and security holes. For these reasons, Google opted to take on the challenge of building and scaling a centralized authorization system that each of its product teams could rely on for consistent authorization logic.
Building a stateful, highly-performant and highly-available authorization service is non-trivial. The Zanzibar paper does a great job of detailing Google's implementation and the many specific enhancements they have made over the years to get Zanzibar to scale to trillions of rules and millions of requests per second. But before jumping into these specifics, it's important to understand the high-level core concepts within Zanzibar.
Perhaps the most important concept to understand within Zanzibar is the 'relation tuple.' A relation tuple is simply a representation of a specific rule or ACL within Zanzibar. A relation tuple is composed of 3 main parts: an object, a relation and a subject (a single object or group of objects). A tuple describes a specific rule within the system that is consulted when making authorization decisions. An example of a relation tuple is 'user:x is a member of tenant:y'.
If relation tuples express specific rules, Zanzibar's schema, segmented by 'namespaces', defines which rules can be created. A schema configuration may specify the valid object types and relations in a namespace. For example, 'namespace-1' might define a 'tenant' object type that supports 'member' and 'admin' relationships. With this schema, it's possible to create rules like 'user:x is a member of tenant:y'. If you're familiar with relational databases, you can think of the Zanzibar schema as a relational database's table schema, whereas relation tuples are like rows in that table adhering to the schema.
Key to Zanzibar's viability and success at Google is its 'user-specifiable' consistency model. Each write transaction within Zanzibar generates a unique, incrementing transaction id known as a 'zookie'. Zookies can be thought of as timestamps, representing events on a linear timeline. On each 'read' (e.g. check operation), clients have the ability to pass a specific zookie instructing the service to conduct the operation on data 'no older than' the timestamp represented by the passed zookie. This guarantees that checks are performed on correct and up-to-date data as per a client's needs. This model of user-provided consistency allows Zanzibar to make use of replicated data and caches wherever possible, while maintaining correctness for the user. Unlike most other eventually consistent systems, Zanzibar gives users the ability to tradeoff consistency and performance on a per request basis.
Implementing a globally distributed authorization service with user-specified consistency is no easy task. Google's globally distributed Spanner database is also mentioned heavily in the paper and is Zanzibar's primary datastore. Spanner provides global sharding and replication, and its 'TrueTime' mechanism provides atomic timestamps that enable the aforementioned zookie/snapshot reads. In addition to Spanner, Zanzibar also employs multiple layered caches optimized to combat 'hot spots' and ensure a p95 of less than 10ms for check requests.
Zanzibar is not the first or only authorization technology out there. There are a number of approaches ranging from home-grown, basic Role Based Access Control (RBAC) libraries to more sophisticated rules and policy engines available on the market. In general, Zanzibar and other similar relationship based access control (ReBAC) services are 'stateful' (i.e. contain all the data needed to make an authorization decision) whereas most policy and rules engines such as Open Policy Agent (OPA) are 'state-less', requiring the caller to pass all contextual information needed to evaluate a policy.
Deciding which technology is better for your tech stack or application depends on various factors. In general, stateful services like Zanzibar are more natural fits for application-layer authorization (e.g. modeling hierarchies, ownership, access in applications etc.) whereas stateless systems like OPA are more commonly used for infrastructure-layer authorization (e.g. ABAC, IP-range blocks etc.). Stateful systems are fully self-contained and can make authorization decisions independently or serve as access-aware indexes, whereas stateless systems like OPA need to be provided domain-specific data for policy evaluation. In general, there are tradeoffs to both, and a fully fledged authorization system will likely make use of elements of both approaches.
Upon reading the Google Zanzibar paper, you might be surprised to learn that Google never productized its Zanzibar service for consumers. Till date, no Zanzibar service has come online or become available within Google Cloud. If you're looking to use a similar service, you'll either need to either build your own or choose to use a vendor service like Warrant.