How Box Does Authorization
Damian Schenkelman: Welcome to Authorization in Software, the podcast that explores everything you need to know about authorization. I'm your host, Damian Schenkelman, and in each episode we'll dive deep into authorization with industry experts as they share their experiences and insights with you. If you're a software developer or just someone that's interested in the world of authorization in software in general, you are in the right place. Let's get started. My name is Damian Schenkelman and today I'm chatting about how Box approaches authorization with John Huffaker, distinguished engineer at Box. Hey, John. It's great to have you here.
John Huffaker: Thanks for having me, Damian.
Damian Schenkelman: It's my pleasure. Could you maybe give our listeners a brief overview of your background on what you're doing at Box nowadays?
John Huffaker: Yeah, for sure. My last job was at a faceted search company called Endeca in Boston. We powered sites like New Egg and things like that, like whenever you search for a part or a video card and then there's the price range and things down the side, we enabled a lot of that. We did a lot of analytics apps, and eventually were acquired by Oracle. I then moved over to Box a decade ago, and the first team I joined there was the team that was launching our public APIs for consumption by platform partners and third party applications, and one of my first big efforts there was on our metadata platform, so I launched our metadata facility which lets customers define types, templates, whatever you want to call them. Like say they want to store a contract with us, and so they could say the amount, the counterparty name, the expiration date, things like that. And we use that, once as they fill out values, you can use that for, again, faceted search, but also workflow. It kind of goes in a lot of different places and we do this a little bit right now, but hopefully over time it also helps drive authorization decisions. Then during building that, so Box is a kind of pre- AWS company, so we had our own data centers and I saw how challenging it was to get, because we were evolving from a monolithic architecture towards a more service- based one, and I saw how hard it was to get a new service launched at Box, so I moved over to work on one of our internal platforms built on Kubernetes. We're super early adopters of Kubernetes. Interesting story. Box built the initial version of kubectl, which people use a lot today.
Damian Schenkelman: I did not know that. That's an interesting fact.
John Huffaker: Yeah. Our co- founder, one of our co- founders actually ended up building that. We were just super bleeding edge. Like deployments didn't exist. We were manually managing replica sets. Super early on the Kube bandwagon and then went through all the different growing pains, but it was a great opportunity to interact with Joe Beta, Craig McLuckie, Brian Grant, Tim Hawkin, all those people that started that project and work with them on how to use it and round out the ecosystem. We built a lot of early components there. It was definitely challenging. I probably, from a business standpoint, probably hopping on the bandwagon a little bit later might've been a better move, but super fun as an engineer to work on a project that early in its lifespan. Then after that got stable, fairly well adopted internally, I moved back to working with our core team, which is if you think about the backend slash platform for everything you think about when you think about Box, like files, folders, users, groups, enterprises, all the just sort of core high scale objects that you think about. That's where I'm working now, just across a very, very, very wide variety of efforts there, including still interfacing with our internal platform teams as we do big migrations as representation from the core team. I think that covers all of it. Hopefully your listeners aren't asleep yet, but.
Damian Schenkelman: It's interesting because you've been through a few different areas of a very large product and platform. In my case, I've done some similar things here at Auth0, now Okta, and that gives kind of a unique perspective whenever you're building authorization capability, platform capability in general, so I think that perspective will be very interesting for anyone listening to the episode because we'll be able to mix general product and technology perspective with a more security and authorization perspective.
John Huffaker: Totally. Yeah. Yeah. I've been here a decade, but I've had three very different jobs over that time period, and Box has big company problems, but sort of small company infrastructure, and so anytime we go to build something new or develop something new, we're always looking at it from a very basic level all the way down to the platform level to think about how we can best serve both our internal and external customers. Yeah. It's a fun place to work. I'm sure you probably have a lot of similar experiences at Okta, just working on very high impact basic stuff all the time.
Damian Schenkelman: Yeah. One of the great things I like to tell folks about my job is that we get to experiment and figure out in our case what the identity industry is going to look like maybe in the next five years, but we are the ones shaping it, considering the impact that we can have.
John Huffaker: Yeah. It's very fun as an engineer.
Damian Schenkelman: Yeah, yeah, definitely. So as always, I'd like to start with the why, right, and this is why I was saying I think your context, not just working on authorization but in general across the Box product and the inaudible platform will be useful. Why is Authorization important at Box and what happens if Authorization is not working properly?
John Huffaker: Yeah. I mean, there's lots of layers to this answer and we can talk through both, but the high level is Box, you have to think about it for a minute, but Box is secretly a security company. So storing files, managing the metadata, managing all that, like all the facilities to extract value out of files certainly isn't easy. But when customers come to us and they buy us, it's the big blocker for most people moving from their on premise shared SANs or shared drives is the security risk, right? Like you're putting your files on the internet. You don't want it to get out there. You don't want the various governments or state actors to get access to them. So the kind of key facility that we sell to our customers, and probably the biggest risk to our business is having a big security incident in terms of losing customer trust. So we spend an enormous amount of time as a company thinking through all the different security aspects. Our security organization tends to break down on two lines. There's the application security and then there's the infrastructure security and our application security team's just very focused on like, how do we ensure that the application that we have is not, you know, like somebody doesn't add a new line of code and make files accessible to anyone that has access to the system or various application concerns. How do we build the product securely, like running pen tests? We work with HackerOne a lot, which has been a really fruitful relationship in terms of getting early pen test results. But also there's the whole other side to that, which is we're basically a file storage, file productivity platform. How do we give tools to enterprises to help their users not expose things? Right? There's always some level of sharing that you want to do with a document either internally or externally and allowing those, your internal users that do that safely and securely is a big deal. All the way down from shared individual shared link expiration and things like that and passwords, all the way up to more advanced stuff like classification policies. We just launched information barriers, functionality that helps keep different parts of the enterprise from seeing each other's content. Email uploads, secure file request features, things like that. Yeah. Security is a very big deal to us and our customers and all of those are different types of authorization decisions. And then the more low level infrastructure oriented authorization decisions are, I'm sure your customers are fairly aware of all these issues of making sure that your inaudible roles are locked down, have least privileged access to various resources, ensuring that an attacker that somehow manages to get access to one node can't pivot and start moving laterally through the infrastructure is another place where we spend a lot of time and we use things like microsegmentation using Calico. I know we've got some inaudible policies down at the infra layer validating that changes, that the person making a change should be able to make that change. Yeah, so just layers and layers of authorization decisions floating around here.
Damian Schenkelman: Yeah, I was going to say, right, you folks store a bunch of sensitive data on behalf of customers and everything else. Right? Security is all layers and layers and different ways of protecting things. One thing that you mentioned was that, again, there are a lot of things that you can do, right? You need to authorize users. You need authorization at the infra level. There are multiple ways in which you can file share. You can file share to a user. You can file share with temporary time limits. You can upload in an email, things like that. And then there's also, you mentioned the notion of at the beginning of having an API and working with partners, which means there's also kind of like app permissions, so a lot of things. At a high level, and we'll dig deeper across the show, but how does Box handle authorization? Can you maybe give us an overview of that?
John Huffaker: Yeah. We have lots, and this is something that's a little bit in flux, but we obviously, depending on the thing that you're asking about, there's lots of opportunities for authorization checks. But the bulk, I'm going to talk specifically here about the application's authorization decision, so like if I fire up my web browser and I browse to a folder and it shows all the items there and then I upload a file, those are all, there's all individual authorization checks being done at each step in that UI. Can I view the folder? Can I view the folder's children? Can I view that file and can I upload a file into that folder are all different authorization checks that we need to do, and given our nature as kind a web application, we tend to, and given the complexity of the object model and authorization scheme we do, we repeat the same checks over and over again on each request. Yeah. Even if I'm viewing, seeing if I can see a user, we have to see can you have access to that user? Can you see that user through your collaboration graph? Are you in the same enterprise? Can I see a comment? Et cetera. And so the way this is orchestrated, and so this is something that's changing as we migrate, as we break down our monolith, but I'm just going to talk specifically about our monolith for a moment. Box has has a lamp, classic lamp stack way back in the day, and within there we have a kind of active record model that was home built, and on each of those at any kind of point in the code you can say, does this user, does this context have the ability to see, does it have this permission, whatever the verb is on this object, and there's some that are built deep down into the data layer. So anytime you try to fetch the file's metadata, its system metadata like name or the size, that kind of like core object information, we check if you have view access. There's a few other places like that. And within that that, that kicks off, that do I have view access kicks off a call to our permissions policies, which are just more PHP code that basically break down into two flows like a grant phase and a revoke phase, and at the end we check, hey, did you get granted view and is it still there? And if yes, then we return true and we return the object. And those policies are just, it's probably the single most complex thing within the Box system. It inaudible access the the number of tables. Very performance sensitive.
Damian Schenkelman: inaudible.
John Huffaker: It checks... Oh, go ahead.
Damian Schenkelman: No, I was going to ask a bit about that, you have kind of like these two ways of federating access for the positive and the negative, but since you can do it from anywhere on the code, inaudible of these things require state. From both code review perspective, performance perspective, security perspective, this might be a bit tough to do, right? Like knowing when you should call it and having to optimize things so that you minimize calls and so on.
John Huffaker: Yeah, and this is one of those things that's specifically interesting as we move from the monolith to the micro services universe. Honestly, a little bit, it's a tiny bit easier. There's a different set of challenges. In the monolith universe, the request is largely happening within the monolith, and so we cache multiple calls or memorize multiple calls to the function, and so adding superfluous checks to the same object isn't too much of a, isn't at all a perfect... It basically short circuits out. But also on the monolith side it can be somewhat, yeah, you want to do a balancing act between calling it too much. But also, like the worst thing that can happen in Box at some level is that you don't call it, right, from going back to the previous thing there. So that is one of the challenges with our monolith is you've got a bunch of library... It's a normal monolith. You've got a bunch of library functions, some of which may call the permission check, some of which may not, and you need to ensure that the permission check is happening, but some of those library functions are used in permissions calculations themselves, and so you don't want to use... And some of them you don't want to be calling using permission. So it gets to be a real struggle reasoning about where the permission checks are done, where it isn't. On the services side of the world, we've landed on a really simple model. We have domain services, they have endpoints, and effectively every one of those endpoints has a permissions policy associated with it, so there's no question there about where it's applied, when it's applied. It's applied very rigorously, versus with the monolith, there's a lot of weight put on the testing to ensure that we're doing... I mean, obviously we need to do testing in both cases, but it's more error prone and so you have to really back yourself up with negative tests and ensure that all the permissions policies you expect are being enforced on an endpoint.
Damian Schenkelman: It's pretty interesting that Box is, I'm checking things now, like 18 years old, right? Starting 2005, and only now you're starting to kind of go on the one hand towards microservices, and there's that whole industry debate always going on about what's the right time, it's not the right time, but also starting to extract policies away from code and into kind of more policy logic.
John Huffaker: Yeah. I'll dive back into what we kind of do in some of the policies in a second, but just to answer that, we've been pulling services out. Like we're kind of taking the strangler pattern slash peeling an onion with our service- based approach. Even when I started, we had some initial very infra centric services that were already being pulled out, and so at the places where it was easy to extract in the services we did, so a lot of the unstructured data handling is already pulled out. A lot of our user extensible metadata is pulled out. A lot of the clients have had a bunch of their logic pulled out. So it's just kind of like we're pulling off the top and bottom of this monolith slowly but surely, and what's left in the middle, which is what we've been tackling for a year- ish, but also it's very complicated and sensitive, so it takes longer are the kind of core business logic. And so, yeah, it's just kind of a long- running process here. Box has a lot of stuff to do, so we go in and out of dedicating more or less resources to an effort like this. But, yeah, to dive back to just for a second to the permissions policies themselves. So within any, like taking a file, for example, in the grant flow, we check, hey, is there a shared link in the request context? We call them collaborations, but you can think about it like an ACL. Are you collaborated on this file as a viewer, as an editor, as a whatever? Are you collaborated up the folder tree, because we do this thing called waterfall permissions, and that's kind of the bulk of the grant phase. There's a few other steps in there that we could talk about, but they're less interesting. And then the revoke phase is much more interesting. Is the thing deleted? Are there legal holds? Are you at quota? You can't upload more of this folder if you're at quota, and that gets a lot more complicated. Yeah, so that's kind of the high level of authorization. I'm curious if there are places you'd love to talk about more or things that were unclear. I'm used to doing this with a whiteboard. It's a little bit harder just gesturing with my hands.
Damian Schenkelman: Yeah, I get it. Maybe in the future season we can do another episode with video podcasting, which is also becoming a fad. I'm curious to learn more about what you're evolving towards, right? Like what technology is this service oriented approach going to use for policies, how you picked it?
John Huffaker: Yeah, and we've gone in and out on this decision and we've reconsidered it a few times, but for our services on our floor, we're using a XACML engine, open source XACML engine. I don't know how you actually pronounce that. XACLM?
Damian Schenkelman: Yeah, XACML. inaudible. Yeah.
John Huffaker: Yeah. We're using Balana, which we have a few developers that are ramped up on and are able to make local patches too in terms of multiattribute prefetch and things like that, so we have quite a few domain services on the floor that are using that for a lot of basic policy, and as of right now, we're in the middle of pulling the giant PHP permissions policy over to it, and that's been an interesting discussion. I mean, XACML's obviously separating the permissions policy from the rest of the codes. Obviously has some really nice benefits in terms of, the one thing I previously mentioned is it's very clear where it's enforced and when it's enforced. One of the challenges in that monolith land, like I mentioned, is you've got some things that need to run, like the permissions itself needs to run in sort of a God mode, supervisor mode in order to fetch the data in order to see if you can even see the data, and so it creates a really clear boundary there where the permissions policies have a lot of access internally and we switch over to using, we call it a domain service token, but a version of our internal token that basically indicates that you should... Because we want to be able to reuse the data access logic from the service to fetch the files, but without inflicting the permissions calculation on, without creating an infinite loop, basically. So yes, that's one super benefit is it becomes very clear where the permissions are going to be enforced. They're going to be enforced on the endpoint and the permissions policies themselves have broader access to the data than the normal natural colors to the endpoint. We looked at a few things. I mean, obviously just like home built policy in Java was one thing/ XACML was another thing, and we've revisited as OPA and Authzen and all those things have come out, whether those would be better choices. For me, having stared at this problem for a while, it feels like some machine learned approach here that minimizes latency of the policy by using any features that you have available at request time, either the enterprise IDs or ID certain request features. Being able to find a fast path through that to the grant would be a wonderful thing. We looked at OPA and I was super excited about OPA for a while. It ended up and it felt like I haven't looked recently at OPA, so I apologize if I'm misstating any of the changes in direction they made, but they seemed like they were really heavily focused on infra security at that moment, really focused on low data scale, almost like IP tables level, like sub millisecond response times, and we have a lot more complicated policy issues here where we touch 60 plus tables. It's a large amount. You couldn't fit it into the static small database they have, like very large data sets including the files folders, their collaboration relationships and things like that, and so it wasn't quite a fit and I felt like we'd kind of be a use case that they would've liked but been way outside where they were getting a lot of success and traction at that time. But, yeah, there've been concerns about... We were handwriting XACML for a while, which is obviously no developer enjoys that, so we actually have a bunch of libraries written in Java that effectively represent, they generate XACML out the other side, so it's closer to a Java like experience with the IDE and things like that. But, yeah, the big question as we move the items permissions policies over are going to be, and as we move to microservices, right? The repeated checks thing starts to become more of a concern. We have a parallel project trying to enable GraphQL for some of our internal clients, and so obviously naively that thing's going to try to hit the same endpoint over and over sometimes. And so the permission recheck thing, figuring out how to do either a request level cache, which we had for a while, or some other type of caching in order to avoid the penalties there is going to become important. And then, yeah, just ensuring that the underlying policy ideally is faster than the thing we currently have hand tuned PHP, but at least at parity with the thing that we have is important. My secret hope as an architect here is the current policy is, like I said, hand tuned, which is a polite way of saying very hard to change, right? Like some of these policies have order dependencies on the other parts and can't move them around without potentially creating a security issue, and so I would love to just get the policies into a place where they're naturally expressed the way you'd want to express them without thinking through the perf implications and then allowing the underlying platform to intelligently make decisions about evaluation order and things like that. Yeah.
Damian Schenkelman: Yeah. No, that makes sense. I think there were a bunch of interesting nuggets there. I'm going to try to see if I got them. So because you already have, instead of policies, when you started looking for alternatives, on the one hand, you talked about OPA and that being more infrastructure oriented, which I would tend to agree. I think Rego as a language allows you to express everything, but OPA and the tool set, it's more thought of for infrastructure when you think about how they handle things like cache and state in general. It seems to kind of go down that line. But at the same time, you already have policies, which means that you need to keep compatibility with that, and maybe you were used to expressing things in a certain way, which you probably want to maintain for security reasons, for simplicity reasons and so on. So again, some at least made the cut and then you were, okay, we're going to go with this implementation, which is Balana, and then from that point on, it's about, okay, how do we improve things? So again, you got it's higher level maybe DSL library in Java so you don't have to write the technical. You mentioned also that you're making changes internally to how the library works, which might be, again, you're making improvements to the caching logic and so on. But you kind of have a small fork of Balana that you're tiering for the Box, and that's kind of what you ended up doing as part of that migration.
John Huffaker: Yep. That sounds right. And we're always open, I think we're always open if something revolutionary shows up, we're always open to considering other options here.
Damian Schenkelman: Yeah. The authorization space is kind of fairly effervescent right now, and new things that are coming up every day. Like I know for example AWS recently released open source at least data policy language, which is they get another thing and has a few interesting benefits from a static analysis perspective. We did an episode on that. There's a bunch of new SaaS companies kind of looking at the space, open source projects, so yeah. I imagine over the next few years you will be able to either replace or complement parts of what you're building with some of these solutions in this space.
John Huffaker: Yeah, that would be great. And if as soon as we got out, the key goal rate here is getting out of this thicket of permissions logic that we have. Like we have things that will short circuit for per reasons that creates the lack of commutivity of the policies, or we'll have things that will reject all permissions and things like that. So getting into XACML and having the stuff naturally expressed puts us in a good place to experiment with other technologies more easily than kind of the free for all PHP code base we have.
Damian Schenkelman: Yeah. No, I get it. That makes sense, and I'd like to kind of double click on that because you mentioned the notion of perfs, and earlier we were talking about all the ways in which you can share documents with folks. So again, this is a problem that I've been thinking about for a few years now, which is when you get into granular document sharing, like anyone can view, or not view, but specifically just one part. It's not the folder, it's not the files, but it's on any attribute or something. It's like they can either view or edit one file. When you get into performance, when you get into Box scale, you are kind of dealing with complex authorization requirements, so how does authorization work for file sharing? What happens when I share a file with another user and give them maybe read permissions or write permissions? How is that handled internally?
John Huffaker: Yeah, so I guess is a click down in terms of detail from the previous answers. But basically, like if you imagine going to our preview page to look at a file and you've been collaborated, to use our term, what happens is we go in, dive into that endpoint and eventually hit the permissions logic, and what'll happen is we pull back... I mean, the first thing we do is we try to access the file, which then triggers the permissions policy, the file metadata, which triggers the permissions policy, which will look through for that user, what are all the collaboration that it's a part of?
Damian Schenkelman: So the file metadata includes the collaborations for the file, which would be the ways in which the user can interact with the file?
John Huffaker: Not fully quite. So we have two tables under the covers. We have a files table, a folders table, and a collabs table, collab being short for collaboration. In the collabs table, it's got the inviting user, the receiving user, as well as their role on that thing, and basically, because of the way things are done... You'd think like, oh, just look at the collab's table for that user ID that's trying to access the file and the file ID. Unfortunately or fortunately, we maintain things very normalized under the covers, so if you were collaborated at a parent folder somewhere, we have to give you access based on the fact that you've got access at that parent folder. It's referred to as waterfall permissions. So basically what ends up happening is on each request where you're accessing a file or folder is we pull back all the things, all the routes or all the folders or all the files that you have access to, and then we quickly look through that list and see which, and we keep that capped. We recommend customers stay under like 10,000 of those, so it does require some thought on the customer side about how they want to store their permissions in a normalized way. But we scan that look to make sure you have access to that file or any of its parents, and then if we do... We drive a lot of UI on the other side. Rarely do we just want to know can you view the file, right? We want to know if you can delete it so we can know if we need to show the delete icon or not. We want to know if you can comment on it. We want to know a bunch of things about what you can do with this thing, so actually in the policies we compute en masse, they largely use the same data, we compute en masse what permissions do you have on this file? Can you comment on it? Can you delete it? Et cetera. And then inaudible-
Damian Schenkelman: So it's not that you don't have to ask the same question multiple times. It's not like you're saying, hey, can the user view, then can the user inaudible, then can the user delete? It's more of a, hey, from all of these, tell me which ones the user can do. Something like that?
John Huffaker: That's basically it, and then that function that I was talking about earlier, can you view it? All that thing does is looks to see if you have that permission amongst that array of permissions. This is something that we call it bulk or batch permissions on the Balana side, but that we have to do something slightly interesting here to fit into the, you called get. We can just evaluate the get on the Balana side, but we do want to be able to bulk calculate the permission sometimes because we do return to our clients, can it be deleted, can it be changed, et cetera. Yeah.
Damian Schenkelman: Yeah, yeah. That makes sense. Let's try to mix things up a bit because you're saying, okay, you can have all of these sides, but you might have this cascading or waterfall permissions where you might have the ability to view a file, not because someone shared the file with you, but maybe they shared the file's parent or maybe even a grandparent folder, so this is kind of like everything is inaudible if you look at it. But again, you might have multiple ways in which you read the file, reach the file, and then you also have groups, right? There's the notion of, hey, I can be part of a group and read the group, has the collab.
John Huffaker: Yeah.
Damian Schenkelman: So it seems that there's a bunch of state that you have to put or have go through the policy so that the policy can arrive at the right decision. I'm curious about a couple of things. One of those is, again, you mentioned some recommendations for customers or keeping things under like 10k objects for roots, but I'm not sure if I follow, so I'd be curious how you limit that, and also I'd like to understand a bit more how you handle the groups with these kind of inaudible denomination like inaudible.
John Huffaker: Yep, yep. Yeah, so you're correct. It is a lot of data and groups is an interesting question in here. We make very extensive use of memcache and Redis along the way through a lot of these calculations, so that thing I mentioned around what collabs am I on and what groups am I in, those pull extensively from cache, and that's probably one of our larger expenses. So in terms of groups specifically, so groups is a fun one for us. In my mind, like when I joined Box, there was no groups and I'm like, this is quite crazy. Like as an enterprise software company that groups would be a key point of scale, like giving companies your customers, some of which are very large, a scalable way to manage access to content. We did add it a little way in. I wouldn't say it's necessarily a hundred percent where I'd like to see it. But, yeah, as of today, just in terms of that specific example we were talking about, we do pull all the groups that you're a member of and we pull all the collabs for the groups that you're a member of and we kind of mash that all in. And there's kind of a deep reason why, like if you stare at this use case and you think about it for a second and you're like, this seems absolutely ludicrous to be pulling all these things back each time, like pulling all the things you have access to when the destination is this one file. Obviously a database index would solve a lot of that for you pretty rapidly. The hidden thing in a lot of this is... So we have three or so and potentially more experiences, one obvious one being search where there's an implicit filter. When I type like, " show me the documents containing musk ox," there's an implicit filter there, right? Show me musk ox in all the documents that I have access to. People don't think about that I have access to. As part of that-
Damian Schenkelman: inaudible. It's challenging.
John Huffaker: Yeah, and as part of that we basically, and we can talk about the low level details of how we index that, and we also have to do that for metadata query, which is another facility we support. But also, the main landing page at Box is this experience we call our all files page, and it's basically kind of a virtual view of everything you own, everything you have access to and a few other things, and so across these three different experiences, we basically end up having to look at everything you have, all the roots you have access to each time anyway, and so it ends up being... We end up just kind of relying on that facility for a lot of, well, basically all of our access control. Even though if you look at any particular use case, there'd probably be some way of optimizing the access there. And so, yeah, groups specifically, and more interestingly like your all enterprise group, which we've started supporting in certain limited cases. Yeah. We go through, we expand the group or we see which groups you're a member of. We start from you, walk out to groups you're a member of and then figure out the collab objects on the other side of that and feed that into that giant machine. My ideal hope over time here, like right now we don't really index the groups that have access to a file in our search indexes and in our metadata indexes. My hope over time is we would start doing that and then just start passing in the group IDs there in order to reduce some of those cardinality challenges, like in terms of that 10k limit that I talked about before. Basically any core system problem at Box is there's just this hard trade off between storing it normalized and having to deal with an expansion and bumping up against that 10k limit of folder IDs that we pass to our search index, or denormalizing it and making it fast, but then potentially bloating out the amount of data we have to store or just having an operation that was fast before, like adding a collaborator, which was like O of 1 before, is now potentially an O of an operation where you have to re-index everything.
Damian Schenkelman: Yeah, and then you get into, depending on how long that takes, if it's few milliseconds, maybe it's fine. If it's going to be three minutes until the change is propagated, you might have issues from a security perspective, and different systems have different trade- offs there.
John Huffaker: Yeah. We're just constantly like, which one is it in this case and thinking through all the implications of the normalization versus denormalization.
Damian Schenkelman: Yeah, I get it. Let's go look at some kind of the numbers behind this. So first, like for each authorization decision, what data are you using? Like we talked about a lot of data. Can you give us an idea, and then what the performance of authorization decisions typically is?
John Huffaker: Yeah, absolutely. I mean, Box in general, and these numbers are going to be nonspecific because I didn't run them by anyone, but we have hundreds of billions of documents that we're storing, like nearly an exabyte in scale of the actual underlying content, hundreds of thousands of enterprise customers, and then the customer size tends to follow standard power law where we have some that are over a hundred thousand users and then a lot of really, a lot smaller ones. I think our average customer size is like 30,000 users or something. And then just in terms of performance, it's highly variable. You can imagine based on how I described the authorization decisions, it can be very highly variable. Like for me, like when I go to my all files page which is my initial landing page and I'm kind of at the harder end of the scale. To just get the basic items, which is mostly just a permissions check on the items, because fetching that read back's not that expensive, it's kind of close to a second, which compared to some of those numbers you see from OPA and Rego is probably two or three orders of magnitude higher. But if you're a smaller customer or you don't have as complicated of a permission scheme, you're going to see lower numbers, but still probably in the hundreds of milliseconds range is probably our minimum there. But we do have like, I mean, our CEO ironically, he's been there for, I don't know what you said, 18 years? He's been there for the 18 years.
Damian Schenkelman: 18 years.
John Huffaker: Has all that data accumulated, and so his all files page experience, which he loves dearly, it can take him upwards of like three to four seconds to load, which is far from what we'd like it to be, and we're working on a lot of projects to drive that number down, but at some level, some of this data that we have to process just takes time to access and run through. In terms of overall throughput, we're dealing with millions of requests per second.
Damian Schenkelman: That's a lot of requests inaudible but inaudible like when you compare to some of the inaudible metrics from our framework saying this has been a system that's in prompt for a lot of years, inaudible, so it's not an apple to apples comparison. Let's just put it like that.
John Huffaker: Yeah, of course. Yeah. I mean, when you're serving an end user app, and we haven't really, I think we're going to maybe talk about integrations in a minute, but we have a lot of things that integrate with us, least of which are our desktop clients, which do basically try to maintain some replica of your view on the file system on the local system, and so we are pushing events out, they're making calls back, and so those drive a tremendous amount of load. We also have third party products like CASB, virus scanners, things like that, that are just constantly hitting the APIs, ensuring that the data's protected. So, yeah, we have a lot of things going on. And then I ran the numbers. We're doing hundreds of thousands of permissions checks per second as part of that. Like anything that's accessing a file, there's some amount of a permission check going on in there. Yeah, so it's quite large scale. Makes it fun. It makes some things, like we use as we're extracting the core stuff, doing a lot of parity testing, it makes things like that feasible, like taking the old policy and the new policy and running them through, comparing the results, ensuring they're the same. It makes those techniques more viable and more fruitful.
Damian Schenkelman: Yeah. I can imagine, again, any change you make, you have to maintain the same backwards compatibility let's say from a result perspective while at the same time making improvements, so it must be an exhaustive set of checks from multiple perspectives.
John Huffaker: Yes.
Damian Schenkelman: You mentioned apps, which, yeah, like we discussed we would touch on as part of the episode, and we mentioned them at the beginning, right? You started working on APIs for some of your partners. I know Box integrates with a lot of other apps. I would say typically kind of like inaudible. How does authorization work with apps? How does a typical Box partner, the Box application that uses Box and depends on the Box API work from an authorization perspective?
John Huffaker: It's largely an added layer, so the API came later after all the permissions policies firmed up, and so the way this tends to work is if you're going to build a third party integration or even a first party integration, even our own teams use this. You provision, we call it a service, but application's a better name. You provision application, you get your client ID, client secret, and from there as part of that creation process, you specify what types of... We do have a different mode where the apps actually become a user within a particular enterprise, and those are more limited to the enterprises. But for your standard kind of client use case where it's doing something for a user, you specify what of that user's underlying permissions you need. Are you just going to be, are you an admin app and you're just managing users for the enterprise or are you contact processing app and you need to be able to access their full tree, their full folder tree, or are you something that only needs access to a particular file or folder? And then we send you through... I mean, we obviously on the backend as part of that process, you can't just publish that to our application repository. It goes through a bunch of internal reviews and then when we publish it, it's able to be used by people outside of your own local enterprise. But basically what happens is it's a standard... We allow a different sets in inaudible flow, but we use OAuth 2, and as part of that... I know there are other ways, like people use scopes, and this might be the standard way, but when people use scopes, they'll effectively pack that whole pre- computed permissions process into the scopes, and then it's just a matter of checking whether the scope has access to that underlying file. Given how dynamic our permissions are, what we actually put in the scope are more of that client's permissions on the underlying user's permissions, so it's kind of an allow list on the underlying permission, so it says this service when it's accessing a file can read or write all of their file or their whole folder tree. And so we basically check, hey, you accessed file 1, 2, 3, 4, 5, we compute the permissions on it, does this user even have access to file 1, 2, 3, 4, 5? And then when it comes up through the API response handling, we check, does this application have the ability to read or write that user's files? And if yes, then we will return the result or allow that API client to make that change. I mean, this is all done much more generically than how I'm talking about it, but that's the basic flow.
Damian Schenkelman: Yeah, yeah. I get what you mean. Actually, the first episode we did of the podcast back in season one was with Vittorio Bertocci who works with us, and we talked about how four scenarios like these where authorization is dynamic because resources and permissions change and there's lots of them, scopes on embedding things in access tokens is not the way to go, so you kind of like, in Box's case, you're saying, hey, we have scopes as a higher level notion, which might be, hey, we are going to be managing users, but you're not going to say which users or I'm going to be reading or writing some folders. And then when you actually have to make the specific call, you just make sure that the access that the user consented to encompasses the specific thing so that if you are changing a folder's name, it would be, hey, can they write this folder's name, and did we actually get access so that this token will be able to edit folder names or whatever that is.
John Huffaker: Exactly.
Damian Schenkelman: Yeah, that makes sense. And here, where is the authorization decision made for the API decisions?
John Huffaker: Yeah, it's all done as part of that underlying permissions policy, I believe. Yeah. It's all done in that underlying permissions policy, and so basically as the last phase, it computes all the permissions, like I said, that you have access to, like can you view, can you comment, can you delete, can you edit? And then as a last step, it will check, like if it was a get API, it's going to check do you have view, and it'll take the application scope and use that as kind of an allow list on the underlying permissions. And if the scope was like, let's say manage users but they're trying to access a file via that app, it won't have view on that underlying file even if the user has access because you didn't grant that access to that application and it's all just kind of all happening in the same place.
Damian Schenkelman: Okay. Yeah. That makes sense. So essentially you reduced the pre- computed set of callouts as we're calling that the user has on the file and then saying, okay, this API code actually requires this collab, so going back to, hey, is the scope the correct one, and does the user have the collab? Makes sense.
John Huffaker: Yeah. It's just sort of two phases. We compute the user's natural permission on the object, and then we basically filter out anything, any access the user didn't grant to the application as part of the OAuth 2 flow, because as part of the OAuth 2 flow, right, we send the user to a screen that says, are you okay allowing this application to read or write your files or manage your users, and so there's part of that whole flow that we have to maintain.
Damian Schenkelman: Yeah. Yeah, the auth consent.
John Huffaker: Yes, exactly.
Damian Schenkelman: Yeah. Okay. John, it's been, it's amazing, man. We did a very deep dive on Box, which is both a complex and large system, and I'm sure the visitors will really appreciate in learning how a large battle tested long lived production SaaS deals with authorization. I really appreciate your time.
John Huffaker: Yeah, Damian. Thanks for having me. And if there are any listeners out there that are working on authorization systems that are interested in more detail, we're always open to conversation with open source developers or companies about what they're building.
Damian Schenkelman: Excellent. Thanks for the ask, and again, as John was saying, you should reach out to him if you're building something that might be useful for Box. That's it for today's episode of Authorization in Software. Thanks for tuning in and listening to us. If you enjoy the show, be sure to subscribe to the podcast on your preferred platform so you'll never miss an episode, and if you have any feedback or suggestions for future episodes, feel free to reach out to us on social media. We love hearing from our listeners. Keep building secure software, and we'll catch you on the next episode of Authorization in Software.
DESCRIPTION
In this episode of Authorization in Software, Damian Schenkelman sits down with John Huffaker, Distinguished Engineer at Box. They discuss how Box, a major file-sharing and collaboration platform, approaches authorization.
The conversation touches upon:
- The importance of security for a platform like Box which handles sensitive data for countless users and businesses.
- A look into the different layers of security, including application and infrastructure security.
- The challenges and solutions to ensure that Box remains impenetrable
- A detailed overview of the multiple layers involved in making different kinds of authorization decisions, from viewing files and folders to understanding user permissions and API accesses.
- And more...
Tune in to get an inside look at the ways Box keeps their customers' data remains safe and the authorization mechanisms they employ to achieve this.