Building an Authorization System

31 Jul 2024 ∼ 13 minutes / 2700 words

Some time around 2019¹, Cloudflare introduced its first version of API Tokens, capable of being scoped to granular levels of access for specific actions (e.g. listing, reading, deleting) against specific resources (e.g. accounts, zones, DNS records).

The underlying authorization system providing these granular access capabilities was mostly built out by yours truly over the years, from early 2018 and until my move away from the Cloudflare IAM team in 2022; credit for the initial design, however, goes to Thomas Hill, who established the data model of subject/action/scope, described in more detail below.

Though the system has evolved in both its user-facing aspects and its scale, the basic principles and overarching design of the authorization system (as well as most of the core code) is still much the same as it was back in 2019; a testament to the robustness and flexibility of the design.

Since then, a number of other designs have also emerged, chief of which is Google’s much-vaunted Zanzibar; these systems are complex enough to merit their own separate posts, but suffice it to say that they approach their design quite differently to ours.

The Basics of Authorization

At the core of any authorization system lies a question, for example:

Can Alice update DNS Record X in Zone Y?

Each part of this sentence has some significance toward the (yes/no) answer we might receive, and each represents some part of our final data-model, though different systems find different ways of answering questions like this.

Let’s start from Alice, our actor, or subject (we’ll prefer using “subject” further on): any request for authorization is assumed to pertain to an action being taken against some protected system, and any action is assumed to originate somewhere, whether that’s a (human) user, or an (automated) system, or anything in between. Moreover, the subject is assumed to provably be who they say they are, that is, be authenticated, before any authorization request is even made, lest we allow subjects to impersonate one another.

Actions, such as “update” above, are typically performed against resources, such as “DNS record X”, both of which pertain against one another in some way; it makes no sense to try to “update a door”, as much as it doesn’t make sense to “open a DNS record”. Resources, in turn, are also identified uniquely (the “X” above) and can furthermore be qualified by context, or “scope”, the “zone Y” above.

A more generalized form of the question above would, then, be:

Can Subject perform Action against Resource with ID, under a specific Scope with ID?

The concepts of subject, action, resource, and scope form the majority of our data-model, with only a sprinkle of organizing parts in between. Before we look at each of these in turn, let’s examine a common thread that exists between them, the idea of a “resource taxonomy”.

What is What: The Resource Taxonomy

How do we determine what makes for a valid authorization request, and how do we ensure that authorization policies are consistent with the sorts of (valid) questions we expect to receive?

Subjects, actions, resources, and their scopes exist in a universe of (expanding) possibilities relevant to their use, and relate to one another intimately.

Different systems solve these issues in different ways, but a centralized system does benefit from solid definitions of what can and cannot be given access to; at Cloudflare, this culminated in a resource taxonomy, an organized hierarchy of possible “things” present in the system, driven by a reverse DNS naming convention, for instance:

com.cloudflare.api.account.zone.dns-record

Which represents an (abstract) DNS record placed under a zone.

Though names appear to contain their full hierarchies (and in some cases do, as DNS records belong to zones, which themselves belong to accounts, with predictable resource naming patterns), they have no real semantics beyond needing to be unique – we could’ve just as well used zone-dns-record as a unique name, though the reverse DNS convention does come in handy when expressing actions and resource identities.

In our case, the (partial) resource hierarchy of relevance looks like so:

com.cloudflare.api.account
- com.cloudflare.api.account.zone
  - com.cloudflare.api.account.zone.dns-record
com.cloudflare.api.user
- com.cloudflare.api.token

If resources have stable references, then actions against those resources also need some sort of stable reference – in our case, we just extend the existing naming convention, adding a verb suffix to resource names, for example:

com.cloudflare.api.account.zone.dns-record.update

Our naming convention is only, thus far, capable of referring to resources in the abstract, and requires a way of specifying which specific resource we’re attempting to give access to; this is accomplished, once again, by adding a suffix to the resource name, this time, an (opaque) unique identifier, e.g.:

com.cloudflare.api.account.zone.dns-record.5d32efec

The 5d32efec suffix refers to an identifier known by the source-of-truth² for DNS records, and is otherwise opaque to the authorization system – any unique sequence of characters would do.

Putting this all together, you might be able to restate our perennial question above like so (omitting the com.cloudflare.api prefix for brevity):

Can user.3cf2e98a do account.zone.dns-record.update against account.zone.dns-record.845cf6a7, under scope account.zone.5ab65c35?

Phew, that’s a mouthful.

How does Authorization Happen?

The point of making sure there’s common understanding on what kinds of things can be given access to relates to how the authorization system is designed as being fundamentally data-agnostic and completely independent from other systems asking questions of it. Two rules play into this determination:

Authorization policies are the sole property of the authorization system; no other service has any access to them. All other systems can do is ask whether or not access is allowed for a given resource/action, with a yes/no answer.
The authorization system cannot and does not ensure the correctness of authorization policies, beyond its own semantics. There is no guarantee that zone X belongs to account Y for a corresponding resource/scope relationship, nor is there any guarantee that DNS record 845cf6a7 is a valid ID for that resource; only the systems-of-record can ensure these invariants.

The assumption, then, is as follows: systems asking for access against a specific resource (e.g. a DNS record) are likely in a good place to ensure the IDs they provide are valid; furthermore, they’re also likely in a good place to know the hierarchy of resources, specifically their immediate parents (e.g. the zone and account) to provide as scopes.

It is through this decision to decouple resource identity from taxonomy (which remains part of the authorization system, mainly for validation purposes) that has made the system as flexible and long-lasting as it has been.

Subjects and Authorization Policies

This business about resources and taxonomies doesn’t actually bring us much closer to answering questions about access in our authorization system; how do we do that?

Firstly, we need to look at the originators of actions, our so-called subjects. As alluded to in the example above, subjects in our system are identified by their unique, fully-qualified resource identifier, e.g.:

com.cloudflare.api.user.3cf2e98a

Which represents a (presumably human) user with ID 3cf2e98a. There’s nothing else our authorization system needs to know about the subject – remember, subjects are assumed to have already authenticated at a level prior to asking about access, typically by a different system, which would then produce a signed token of some kind (e.g. a JWT), which would then be provided as context to requests made against the authorization system.

Access for subjects is expressed as a collection of “policies”, themselves collections of action and resource identifiers. An example pseudo-policy that would fulfill access for previous examples might look like this:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.845cf6a7
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35

Resolving access then becomes a simple matter of traversing assigned policies, and applying the following criteria for each:

Check if actions list contains the action requested.
Check if resources list contains the resource requested. If a matching resource contains a list of scopes, check that the request contains matching scope names, ignoring any additional scopes given in the request.

If any policy in the list matches all criteria, then access is allowed.

Extending and Generalizing Access

Allowing access to specific resources by ID is all fine and well, until you have to provide access to all DNS records across all zones in an account, including any future DNS records. Rather than putting the burden of updating policies onto humans (or worse, some automated system somewhere), we need a way to allow access to classes of things, all at once.

Turns out our reverse DNS naming convention fits this use-case well; rather than using a concrete identifier as a resource suffix, we can simply use an asterisk to denote a partial wildcard, for instance:

com.cloudflare.api.account.zone.dns-record.*

Use of wildcards here does not intend to denote any kind of lexical matching of IDs – that is to say, you couldn’t really use dns-record.5c* to match resources with IDs starting with 5c, as identifiers are opaque as far as the authorization system is concerned.

Rather, the use of an asterisk denotes access to all resources of that kind, but also introduces additional constraints on our policies, namely the mandatory use of a scope, lest we want to provide access to all DNS records anywhere.

A modified policy giving access to all DNS records for our example zone would then look like so:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.*
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35

Similar to partial wildcards, we can also specify a standalone asterisk (i.e. a * without a resource name) as a “catch-all wildcard”, denoting access to all resources under a scope.

As an extra rule, we might define that catch-all wildcard access includes its top-most scope as if it were the resource itself, if this is a fully-qualified or partial wildcard resource. That is, the following policy:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- actions:
  - key: com.cloudflare.api.account.zone.read
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: *
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35

Will allow access for a check against action zone.read and resource zone.5ab65c35. Nevertheless, wildcard matching is only available at the resource level; scopes provided in policies are always assumed to match directly, with no wildcard semantics.

Excluding Access and Policy Prioritization

So far, our policies have been exclusively aimed at giving unequivocal access to resources. There are times, however, where we might want to throw a fence around things, lest we have the hoi polloi trample on our metaphorical petunias.

Doing so with our system as described is deceptively simple, though some complexity lurks underneath the surface. Let’s first see how we might, for example, give access to all DNS records in a zone, except for a single specific record:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- access: allow
  actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.*
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35
- access: deny
  actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.65caf35c
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35

In case you missed it, the addition of an access field with allow or deny values denotes which stance a policy takes, and whether a full match will mean access is allowed or denied.

Herein the problems begin: requesting access for dns-record.65caf35c under zone.5ab65c35 will have both policies match, one to allow since we’re given access to all DNS records for the zone, and one to deny, since we’re denied access to the specific DNS record.

How do we resolve this conflict?

We could determine which policy “wins” by just taking the last decision made (i.e. the last policy in the list) as final; that, however, would put the onus of ensuring that policies are ordered the right way on users, with potentially catastrophic consequences if they are not.

We must therefore assume the intentions of our users – why would anyone take away access, only to give it back immediately? Clearly the opposite must always be true (especially since no access at all is the default): deny policies always trump allow policies, if the two overlap.

We can, and will, however, further elaborate on this rule, as the specificity of matching matters as well; direct matches against specific resources trump partial wildcard matches, which trump catch-all wildcard matches. It should, then, be possible to say the following:

Allow access to all resources under account X, but deny access to all resources under zone Y (including the zone itself), except for DNS records, but not including DNS record Z.

Which we might translate into the following policy representation:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- access: allow
  actions:
  - key: com.cloudflare.api.account.zone.read
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: *
    scopes:
    - key: com.cloudflare.api.account.9cfe45ac
- access: deny
  actions:
  - key: com.cloudflare.api.account.zone.read
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: *
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35
    - key: com.cloudflare.api.account.9cfe45ac
- access: allow
  actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.*
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35
    - key: com.cloudflare.api.account.9cfe45ac
- access: deny
  actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.65caf35c
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35
    - key: com.cloudflare.api.account.9cfe45ac

Of course, policies this complex are fairly rare, but they do exist, and catering to these requirements is important in a system that purports to be as flexible as possible.

Scaling the System

So far, we’ve been providing lists of actions and resources as direct references, but doing so in real life would be incredibly onerous (especially given the large number of options available to us). The solution to this is – you guessed it – normalization, or in other words, grouping things under a unique name we can refer to.

For actions, we can form action groups, or as they’re sometimes (and perhaps confusingly) called, “roles”. These would represent collections of possible actions available to a subject in the abstract, not tied to any specific resource or scope. For instance, an example “DNS Administrator” action group might look like this:

id: 9aff84ac
name: DNS Administrator
actions:
- key: com.cloudflare.api.account.zone.dns-record.read
- key: com.cloudflare.api.account.zone.dns-record.create
- key: com.cloudflare.api.account.zone.dns-record.update
- key: com.cloudflare.api.account.zone.dns-record.delete

Similarly, resource definitions can benefit from being named and referenced separately, for instance:

id: fd25a5dd
name: Production Zones
resources:
- key: com.cloudflare.api.account.zone.2acf325f
  scopes:
  - key: com.cloudflare.api.account.6afe524a
- key: com.cloudflare.api.account.zone.33cfade6
  scopes:
  - key: com.cloudflare.api.account.6afe524a

One might then assign these action groups and resource groups to a policy under a separate action_groups and resource_groups field respetively, to the exclusion of an actions and resources fields for the policy, e.g.:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- access: allow
  action_groups:
  - id: 9aff84ac # DNS Administrator
  resource_groups:
  - id: fd25a5dd # Production Zones

It is assumed that these action and resource groups are managed separately, and can be re-used as needed by the owning user (which also implies that access to these resources is also governed by the authorization system, which needs to control access from itself, a fun exercise in recursion).

Conclusion

This post is not so much a guide of what the current (or really, past) state of the production system is at Cloudflare; a number of omissions and simplifications have been made to save on making this a rambling epic.

Rather, it is a high-level overview of the design thinking behind a production system that is part of every single request made to the public Cloudflare API (and a few more still), and which hopefully serves as a map for others looking to build similar systems, the core idea being: authorization is about ensuring the basic questions being asked can be answered as quickly and unambiguously as possible.

Typically, this means a lot of work goes into policy semantics, and for us, this meant building out a resource taxonomy in order to balance the data-agnostic nature of the system with the need to ensure that policies are always correctly formulated.

There is still a lot of ground left to cover, however, first of all being ABAC. I’ll leave that for a future post, then.

Specifically, August 2019, though the beta was opened a couple of months earlier. ↩︎
In other words, the service or component responsible for handling DNS records. ↩︎