System Design Interview: An Insider’s GuideAll rights reserved. This book or any portion thereof may not be reproduced or used in anymanner whatsoever without the express written permission of the publisher except for the useof brief quotations in a book review.About the author:Alex Xu is an experienced software engineer and entrepreneur. Previously, he worked atTwitter, Apple, Zynga and Oracle. He received his M.S. from Carnegie Mellon University.He has a passion for designing and implementing complex systems.Please subscribe to our email list if you want to be notified when new chapters are available: more information, contact [email protected]: Paul Solomon


FORWARDWe are delighted that you have decided to join us in learning the system design interviews.System design interview questions are the most difficult to tackle among all the technicalinterviews. The questions require the interviewees to design an architecture for a softwaresystem, which could be a news feed, Google search, chat system, etc. These questions areintimidating, and there is no certain pattern to follow. The questions are usually very bigscoped and vague. The processes are open-ended and unclear without a standard or correctanswer.Companies widely adopt system design interviews because the communication and problemsolving skills tested in these interviews are similar to those required by a software engineer’sdaily work. An interviewee is evaluated based on how she analyzes a vague problem and howshe solves the problem step by step. The abilities tested also involve how she explains theidea, discusses with others, and evaluates and optimizes the system. In English, using “she”flows better than “he or she” or jumping between the two. To make reading easier, we use thefeminine pronoun throughout this book. No disrespect is intended for male engineers.The system design questions are open-ended. Just like in the real world, there are manydifferences and variations in the system. The desired outcome is to come up with anarchitecture to achieve system design goals. The discussions could go in different waysdepending on the interviewer. Some interviewers may choose high-level architecture to coverall aspects; whereas some might choose one or more areas to focus on. Typically, systemrequirements, constraints and bottlenecks should be well understood to shape the direction ofboth the interviewer and interviewee.The objective of this book is to provide a reliable strategy to approach the system designquestions. The right strategy and knowledge are vital to the success of an interview.This book provides solid knowledge in building a scalable system. The more knowledgegained from reading this book, the better you are equipped in solving the system designquestions.This book also provides a step by step framework on how to tackle a system design question.It provides many examples to illustrate the systematic approach with detailed steps that youcan follow. With constant practice, you will be well-equipped to tackle system designinterview questions.

CHAPTER 1: SCALE FROM ZERO TO MILLIONS OFUSERSDesigning a system that supports millions of users is challenging, and it is a journey thatrequires continuous refinement and endless improvement. In this chapter, we build a systemthat supports a single user and gradually scale it up to serve millions of users. After readingthis chapter, you will master a handful of techniques that will help you to crack the systemdesign interview questions.

Single server setupA journey of a thousand miles begins with a single step, and building a complex system is nodifferent. To start with something simple, everything is running on a single server. Figure 1-1shows the illustration of a single server setup where everything is running on one server: webapp, database, cache, etc.To understand this setup, it is helpful to investigate the request flow and traffic source. Let usfirst look at the request flow (Figure 1-2).

1. Users access websites through domain names, such as Usually, theDomain Name System (DNS) is a paid service provided by 3rd parties and not hosted byour servers.2. Internet Protocol (IP) address is returned to the browser or mobile app. In the example,IP address is returned.3. Once the IP address is obtained, Hypertext Transfer Protocol (HTTP) [1] requests aresent directly to your web server.4. The web server returns HTML pages or JSON response for rendering.Next, let us examine the traffic source. The traffic to your web server comes from twosources: web application and mobile application. Web application: it uses a combination of server-side languages (Java, Python, etc.) tohandle business logic, storage, etc., and client-side languages (HTML and JavaScript) forpresentation. Mobile application: HTTP protocol is the communication protocol between the mobileapp and the web server. JavaScript Object Notation (JSON) is commonly used APIresponse format to transfer data due to its simplicity. An example of the API response inJSON format is shown below:GET /users/12 – Retrieve user object for id 12

DatabaseWith the growth of the user base, one server is not enough, and we need multiple servers: onefor web/mobile traffic, the other for the database (Figure 1-3). Separating web/mobile traffic(web tier) and database (data tier) servers allows them to be scaled independently.Which databases to use?You can choose between a traditional relational database and a non-relational database. Letus examine their differences.Relational databases are also called a relational database management system (RDBMS) orSQL database. The most popular ones are MySQL, Oracle database, PostgreSQL, etc.Relational databases represent and store data in tables and rows. You can perform joinoperations using SQL across different database tables.Non-Relational databases are also called NoSQL databases. Popular ones are CouchDB,Neo4j, Cassandra, HBase, Amazon DynamoDB, etc. [2]. These databases are grouped intofour categories: key-value stores, graph stores, column stores, and document stores. Joinoperations are generally not supported in non-relational databases.For most developers, relational databases are the best option because they have been aroundfor over 40 years and historically, they have worked well. However, if relational databasesare not suitable for your specific use cases, it is critical to explore beyond relationaldatabases. Non-relational databases might be the right choice if: Your application requires super-low latency. Your data are unstructured, or you do not have any relational data. You only need to serialize and deserialize data (JSON, XML, YAML, etc.). You need to store a massive amount of data.

Vertical scaling vs horizontal scalingVertical scaling, referred to as “scale up”, means the process of adding more power (CPU,RAM, etc.) to your servers. Horizontal scaling, referred to as “scale-out”, allows you to scaleby adding more servers into your pool of resources.When traffic is low, vertical scaling is a great option, and the simplicity of vertical scaling isits main advantage. Unfortunately, it comes with serious limitations. Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to asingle server. Vertical scaling does not have failover and redundancy. If one server goes down, thewebsite/app goes down with it completely.Horizontal scaling is more desirable for large scale applications due to the limitations ofvertical scaling.In the previous design, users are connected to the web server directly. Users will unable toaccess the website if the web server is offline. In another scenario, if many users access theweb server simultaneously and it reaches the web server’s load limit, users generallyexperience slower response or fail to connect to the server. A load balancer is the besttechnique to address these problems.

Load balancerA load balancer evenly distributes incoming traffic among web servers that are defined in aload-balanced set. Figure 1-4 shows how a load balancer works.As shown in Figure 1-4, users connect to the public IP of the load balancer directly. With thissetup, web servers are unreachable directly by clients anymore. For better security, privateIPs are used for communication between servers. A private IP is an IP address reachable onlybetween servers in the same network; however, it is unreachable over the internet. The loadbalancer communicates with web servers through private IPs.In Figure 1-4, after a load balancer and a second web server are added, we successfullysolved no failover issue and improved the availability of the web tier. Details are explainedbelow: If server 1 goes offline, all the traffic will be routed to server 2. This prevents the websitefrom going offline. We will also add a new healthy web server to the server pool tobalance the load. If the website traffic grows rapidly, and two servers are not enough to handle the traffic,the load balancer can handle this problem gracefully. You only need to add more serversto the web server pool, and the load balancer automatically starts to send requests to them.Now the web tier looks good, what about the data tier? The current design has one database,

so it does not support failover and redundancy. Database replication is a common techniqueto address those problems. Let us take a look.

Database replicationQuoted from Wikipedia: “Database replication can be used in many database managementsystems, usually with a master/slave relationship between the original (master) and the copies(slaves)” [3].A master database generally only supports write operations. A slave database gets copies ofthe data from the master database and only supports read operations. All the data-modifyingcommands like insert, delete, or update must be sent to the master database. Mostapplications require a much higher ratio of reads to writes; thus, the number of slavedatabases in a system is usually larger than the number of master databases. Figure 1-5 showsa master database with multiple slave databases.Advantages of database replication: Better performance: In the master-slave model, all writes and updates happen in masternodes; whereas, read operations are distributed across slave nodes. This model improvesperformance because it allows more queries to be processed in parallel. Reliability: If one of your database servers is destroyed by a natural disaster, such as atyphoon or an earthquake, data is still preserved. You do not need to worry about data lossbecause data is replicated across multiple locations. High availability: By replicating data across different locations, your website remains in

operation even if a database is offline as you can access data stored in another databaseserver.In the previous section, we discussed how a load balancer helped to improve systemavailability. We ask the same question here: what if one of the databases goes offline? Thearchitectural design discussed in Figure 1-5 can handle this case: If only one slave database is available and it goes offline, read operations will be directedto the master database temporarily. As soon as the issue is found, a new slave databasewill replace the old one. In case multiple slave databases are available, read operations areredirected to other healthy slave databases. A new database server will replace the old one. If the master database goes offline, a slave database will be promoted to be the newmaster. All the database operations will be temporarily executed on the new masterdatabase. A new slave database will replace the old one for data replication immediately.In production systems, promoting a new master is more complicated as the data in a slavedatabase might not be up to date. The missing data needs to be updated by running datarecovery scripts. Although some other replication methods like multi-masters and circularreplication could help, those setups are more complicated; and their discussions arebeyond the scope of this book. Interested readers should refer to the listed referencematerials [4] [5].Figure 1-6 shows the system design after adding the load balancer and database replication.

Let us take a look at the design: A user gets the IP address of the load balancer from DNS. A user connects the load balancer with this IP address. The HTTP request is routed to either Server 1 or Server 2. A web server reads user data from a slave database. A web server routes any data-modifying operations to the master database. This includeswrite, update, and delete operations.Now, you have a solid understanding of the web and data tiers, it is time to improve theload/response time. This can be done by adding a cache layer and shifting static content(JavaScript/CSS/image/video files) to the content delivery network (CDN).

CacheA cache is a temporary storage area that stores the result of expensive responses or frequentlyaccessed data in memory so that subsequent requests are served more quickly. As illustratedin Figure 1-6, every time a new web page loads, one or more database calls are executed tofetch data. The application performance is greatly affected by calling the database repeatedly.The cache can mitigate this problem.Cache tierThe cache tier is a temporary data store layer, much faster than the database. The benefits ofhaving a separate cache tier include better system performance, ability to reduce databaseworkloads, and the ability to scale the cache tier independently. Figure 1-7 shows a possiblesetup of a cache server:After receiving a request, a web server first checks if the cache has the available response. Ifit has, it sends data back to the client. If not, it queries the database, stores the response incache, and sends it back to the client. This caching strategy is called a read-through cache.Other caching strategies are available depending on the data type, size, and access patterns. Aprevious study explains how different caching strategies work [6].Interacting with cache servers is simple because most cache servers provide APIs forcommon programming languages. The following code snippet shows typical MemcachedAPIs:Considerations for using cacheHere are a few considerations for using a cache system: Decide when to use cache. Consider using cache when data is read frequently butmodified infrequently. Since cached data is stored in volatile memory, a cache server isnot ideal for persisting data. For instance, if a cache server restarts, all the data in memoryis lost. Thus, important data should be saved in persistent data stores. Expiration policy. It is a good practice to implement an expiration policy. Once cacheddata is expired, it is removed from the cache. When there is no expiration policy, cacheddata will be stored in the memory permanently. It is advisable not to make the expirationdate too short as this will cause the system to reload data from the database too frequently.Meanwhile, it is advisable not to make the expiration date too long as the data can becomestale. Consistency: This involves keeping the data store and the cache in sync. Inconsistencycan happen because data-modifying operations on the data store and cache are not in asingle transaction. When scaling across multiple regions, maintaining consistency between

the data store and cache is challenging. For further details, refer to the paper titled“Scaling Memcache at Facebook” published by Facebook [7]. Mitigating failures: A single cache server represents a potential single point of failure(SPOF), defined in Wikipedia as follows: “A single point of failure (SPOF) is a part of asystem that, if it fails, will stop the entire system from working” [8]. As a result, multiplecache servers across different data centers are recommended to avoid SPOF. Anotherrecommended approach is to overprovision the required memory by certain percentages.This provides a buffer as the memory usage increases. Eviction Policy: Once the cache is full, any requests to add items to the cache mightcause existing items to be removed. This is called cache eviction. Least-recently-used(LRU) is the most popular cache eviction policy. Other eviction policies, such as the LeastFrequently Used (LFU) or First in First Out (FIFO), can be adopted to satisfy different usecases.

Content delivery network (CDN)A CDN is a network of geographically dispersed servers used to deliver static content. CDNservers cache static content like images, videos, CSS, JavaScript files, etc.Dynamic content caching is a relatively new concept and beyond the scope of this book. Itenables the caching of HTML pages that are based on request path, query strings, cookies,and request headers. Refer to the article mentioned in reference material [9] for more aboutthis. This book focuses on how to use CDN to cache static content.Here is how CDN works at the high-level: when a user visits a website, a CDN server closestto the user will deliver static content. Intuitively, the further users are from CDN servers, theslower the website loads. For example, if CDN servers are in San Francisco, users in LosAngeles will get content faster than users in Europe. Figure 1-9 is a great example that showshow CDN improves load time.Figure 1-10 demonstrates the CDN workflow.1. User A tries to get image.png by using an image URL. The URL’s domain is providedby the CDN provider. The following two image URLs are samples used to demonstratewhat image URLs look like on Amazon and Akamai CDNs: pg2. If the CDN server does not have image.png in the cache, the CDN server requests thefile from the origin, which can be a web server or online storage like Amazon S3.3. The origin returns image.png to the CDN server, which includes optional HTTP headerTime-to-Live (TTL) which describes how long the image is cached.

4. The CDN caches the image and returns it to User A. The image remains cached in theCDN until the TTL expires.5. User B sends a request to get the same image.6. The image is returned from the cache as long as the TTL has not expired.Considerations of using a CDN Cost: CDNs are run by third-party providers, and you are charged for data transfers inand out of the CDN. Caching infrequently used assets provides no significant benefits soyou should consider moving them out of the CDN. Setting an appropriate cache expiry: For time-sensitive content, setting a cache expirytime is important. The cache expiry time should neither be too long nor too short. If it istoo long, the content might no longer be fresh. If it is too short, it can cause repeatreloading of content from origin servers to the CDN. CDN fallback: You should consider how your website/application copes with CDNfailure. If there is a temporary CDN outage, clients should be able to detect the problemand request resources from the origin. Invalidating files: You can remove a file from the CDN before it expires by performingone of the following operations: Invalidate the CDN object using APIs provided by CDN vendors. Use object versioning to serve a different version of the object. To version an object,you can add a parameter to the URL, such as a version number. For example, versionnumber 2 is added to the query string: image.png?v 2.Figure 1-11 shows the design after the CDN and cache are added.

1. Static assets (JS, CSS, images, etc.,) are no longer served by web servers. They arefetched from the CDN for better performance.2. The database load is lightened by caching data.

Stateless web tierNow it is time to consider scaling the web tier horizontally. For this, we need to move state(for instance user session data) out of the web tier. A good practice is to store session data inthe persistent storage such as relational database or NoSQL. Each web server in the clustercan access state data from databases. This is called stateless web tier.Stateful architectureA stateful server and stateless server has some key differences. A stateful server remembersclient data (state) from one request to the next. A stateless server keeps no state information.Figure 1-12 shows an example of a stateful architecture.In Figure 1-12, user A’s session data and profile image are stored in Server 1. To authenticateUser A, HTTP requests must be routed to Server 1. If a request is sent to other servers likeServer 2, authentication would fail because Server 2 does not contain User A’s session data.Similarly, all HTTP requests from User B must be routed to Server 2; all requests from UserC must be sent to Server 3.The issue is that every request from the same client must be routed to the same server. Thiscan be done with sticky sessions in most load balancers [10]; however, this adds theoverhead. Adding or removing servers is much more difficult with this approach. It is alsochallenging to handle server failures.Stateless architectureFigure 1-13 shows the stateless architecture.

In this stateless architecture, HTTP requests from users can be sent to any web servers, whichfetch state data from a shared data store. State data is stored in a shared data store and keptout of web servers. A stateless system is simpler, more robust, and scalable.Figure 1-14 shows the updated design with a stateless web tier.

In Figure 1-14, we move the session data out of the web tier and store them in the persistentdata store. The shared data store could be a relational database, Memcached/Redis, NoSQL,etc. The NoSQL data store is chosen as it is easy to scale. Autoscaling means adding orremoving web servers automatically based on the traffic load. After the state data is removedout of web servers, auto-scaling of the web tier is easily achieved by adding or removingservers based on traffic load.Your website grows rapidly and attracts a significant number of users internationally. Toimprove availability and provide a better user experience across wider geographical areas,supporting multiple data centers is crucial.

Data centersFigure 1-15 shows an example setup with two data centers. In normal operation, users aregeoDNS-routed, also known as geo-routed, to the closest data center, with a split traffic ofx% in US-East and (100 – x)% in US-West. geoDNS is a DNS service that allows domainnames to be resolved to IP addresses based on the location of a user.In the event of any significant data center outage, we direct all traffic to a healthy data center.In Figure 1-16, data center 2 (US-West) is offline, and 100% of the traffic is routed to datacenter 1 (US-East).

Several technical challenges must be resolved to achieve multi-data center setup: Traffic redirection: Effective tools are needed to direct traffic to the correct data center.GeoDNS can be used to direct traffic to the nearest data center depending on where a useris located. Data synchronization: Users from different regions could use different local databases orcaches. In failover cases, traffic might be routed to a data center where data is unavailable.A common strategy is to replicate data across multiple data centers. A previous studyshows how Netflix implements asynchronous multi-data center replication [11]. Test and deployment: With multi-data center setup, it is important to test yourwebsite/application at different locations. Automated deployment tools are vital to keepservices consistent through all the data centers [11].To further scale our system, we need to decouple different components of the system so theycan be scaled independently. Messaging queue is a key strategy employed by many realworld distributed systems to solve this problem.

Message queueA message queue is a durable component, stored in memory, that supports asynchronouscommunication. It serves as a buffer and distributes asynchronous requests. The basicarchitecture of a message queue is simple. Input services, called producers/publishers, createmessages, and publish them to a message queue. Other services or servers, calledconsumers/subscribers, connect to the queue, and perform actions defined by the messages.The model is shown in Figure 1-17.Decoupling makes the message queue a preferred architecture for building a scalable andreliable application. With the message queue, the producer can post a message to the queuewhen the consumer is unavailable to process it. The consumer can read messages from thequeue even when the producer is unavailable.Consider the following use case: your application supports photo customization, includingcropping, sharpening, blurring, etc. Those customization tasks take time to complete. InFigure 1-18, web servers publish photo processing jobs to the message queue. Photoprocessing workers pick up jobs from the message queue and asynchronously perform photocustomization tasks. The producer and the consumer can be scaled independently. When thesize of the queue becomes large, more workers are added to reduce the processing time.However, if the queue is empty most of the time, the number of workers can be reduced.

Logging, metrics, automationWhen working with a small website that runs on a few servers, logging, metrics, andautomation support are good practices but not a necessity. However, now that your site hasgrown to serve a large business, investing in those tools is essential.Logging: Monitoring error logs is important because it helps to identify errors and problemsin the system. You can monitor error logs at per server level or use tools to aggregate them toa centralized service for easy search and viewing.Metrics: Collecting different types of metrics help us to gain business insights and understandthe health status of the system. Some of the following metrics are useful: Host level metrics: CPU, Memory, disk I/O, etc. Aggregated level metrics: for example, the performance of the entire database tier, cachetier, etc. Key business metrics: daily active users, retention, revenue, etc.Automation: When a system gets big and complex, we need to build or leverage automationtools to improve productivity. Continuous integration is a good practice, in which each codecheck-in is verified through automation, allowing teams to detect problems early. Besides,automating your build, test, deploy process, etc. could improve developer productivitysignificantly.Adding message queues and different toolsFigure 1-19 shows the updated design. Due to the space constraint, only one data center isshown in the figure.1. The design includes a message queue, which helps to make the system more looselycoupled and failure resilient.2. Logging, monitoring, metrics, and automation tools are included.

As the data grows every day, your database gets more overloaded. It is time to scale the datatier.

Database scalingThere are two broad approaches for database scaling: vertical scaling and horizontal scaling.Vertical scalingVertical scaling, also known as scaling up, is the scaling by adding more power (CPU, RAM,DISK, etc.) to an existing machine. There are some powerful database servers. According toAmazon Relational Database Service (RDS) [12], you can get a database server with 24 TBof RAM. This kind of powerful database server could store and handle lots of data. Forexample, in 2013 had over 10 million monthly unique visitors, but it onlyhad 1 master database [13]. However, vertical scaling comes with some serious drawbacks: You can add more CPU, RAM, etc. to your database server, but there are hardwarelimits. If you have a large user base, a single server is not enough. Greater risk of single point of failures. The overall cost of vertical scaling is high. Powerful servers are much more expensive.Horizontal scalingHorizontal scaling, also known as sharding, is the practice of adding more servers. Figure 120 compares vertical scaling with horizontal scaling.Sharding separates large databases into smaller, more easily managed parts called shards.Each shard shares the same schema, though the actual data on each shard is unique to theshard.Figure 1-21 shows an example of sharded databases. User data is allocated to a databaseserver based on user IDs. Anytime you access data, a hash function is used to find thecorresponding shard. In our example, user id % 4 is used as the hash function. If the result

equals to 0, shard 0 is used to store and fetch data. If the result equals to 1, shard 1 is used.The same logic applies to other shards.Figure 1-22 shows the user table in sharded databases.The most important factor to consider when implementing a sharding strategy is the choice ofthe sharding key. Sharding key (known as a partition key) consists of one or more columnsthat determine how data is distributed. As shown in Figure 1-22, “user id” is the shardingkey. A sharding key allows you to retrieve and modify data efficiently by routing databasequeries to the correct database. When choosing a sharding key, one of the most important

criteria is to choose a key that can evenly distributed data.Sharding is a great technique to scale the database but it is far from a perfect solution. Itintroduces complexities and

System design interview questions are the most difficult to tackle among all the technical interviews. The questions require the interviewees to design an architecture for a software system, which could be a news feed, Google search, chat system, etc. These questions are