25+ Best Distributed Systems Interview Questions & Answers

distributed systems interview questions

If you are looking for distributed systems interview questions and answers, look no further. Our team has put together the absolute best resource for you to prepare as a software engineer. As software organizations grow in size and serve larger and larger user bases, distributed computing is an important concept for designing systems that can scale with their requirements. Most highly scaled applications like Instagram, Facebook, Youtube, or Twitter rely heavily on distributed computing to make sure their applications are reliable, available, and performant. Because it is such an integral part of application system design, interview questions about distributed computing are central to many engineering and DevOps hiring processes; especially those at the Senior Engineer or Architect level.

What is Distributed Systems?

A distributed system is essentially a group of independent computers that are linked together by a single network. These groups of computers work together to appear as a single computer to the end user.

These computers have a shared state and operate concurrently. They can fail independently without affecting the entire system’s uptime, making distributed systems architecture a sort of failsafe.

Preparing For Your Interview With This Guide

The questions below are meant to assess a candidate’s practical understanding of the benefits distributed computing provides to large scale applications, and how some of these concepts are applied in a typical engineering organization.

27 Best Distributed Systems Interview Questions & Answers

1. What is distributed computing? Give a real-world example of distributed computing and why it qualifies.

Distributed computing is the use of distributed systems in computer science, which describes any system made up of components which operate on different computers which are networked together.

A Blockchain system like the Bitcoin cryptocurrency is one example of a distributed system. In a traditional banking system, the ledger of all of the existing credits and debits exist on a centralized computer owned by the bank. A cryptocurrency uses blockchain to maintain a distributed ledger which exists on many different computers at once. Whenever a credit or debit is written to the ledger, it is written on every copy of the cryptocurrency’s ledger in the blockchain.

2. What are microservices in an application? How is an application built using microservices different than a monolithic application?

Microservices are small applications, usually running on their own infrastructure or within a virtual machine, which is responsible for one segment of a larger application’s business requirements. A microservice may have various software components like its own processing, memory, data storage, and its own APIs.

Imagine the opposite; a monolithic application which exists on one centralized piece of infrastructure. Our imaginary monolithic application is responsible for all or most of the application’s business requirements. For example, looking at the database models of our imaginary application a developer would likely see references to multiple entities in the system. e.g. Users, Organizations, Surveys, Subscriptions, and Teams. Our monolithic application exposes a large REST API, where clients can interact with any of these entities.

There are many ways to build a microservice architecture, but breaking such an application into microservices might commonly look like a separate application for each of the entities each of which may run on its own set of separate infrastructure. An entirely separate application can be built, for example, Users Application, Surveys Application, Subscriptions Application, and Teams Application, each of which exposes its own smaller REST API for interacting with the separate entities.

3. How do components communicate in a distributed system?

Components in a distributed system communicate via message passing. Message passing means that each component in the distributed system has a contract which defines what sorts of messages it is able to receive and interpret, and what sorts of messages it will give as responses. Documenting these contracts for each component is important because people working on one component gain the ability to interact with the wider system by following the contract. A team responsible for the inner operations of two different components of the distributed system can work “safely”, so long as they do not change the contract of their microservice.

4. What is a way that two components in a distributed system might pass messages?

A common way for components in a distributed application to communicate is via HTTP. HTTP is the protocol your web browser uses to send data to and from web-pages on the internet. Within a distributed application, different servers can use HTTP similarly to communicate information amongst themselves and bring a full application piece of software to life. RESTful APIs are one common specification for how message passing can be accomplished between components in a web application.

There are other protocols that can achieve message passing between components, for example, Thrift and GRPC accomplish message fasting with higher throughput and without the use of HTTP. This makes them potentially good options for applications where the amount of traffic and communication between components is expected to be very high.

5. Describe one of the primary differences between parallel computing and distributed computing?

A major difference between parallel computing and distributed computing is that in a parallel computing system each component of the system has access to shared memory. In distributed computing, there is no shared memory and each component must communicate information about its internal state via message passing.

6. Imagine you have a microservices architecture where one microservice is responsible for Users and a separate microservice is responsible for Wallets. These microservices use REST to communicate. While working on a requirement, an engineer on your team recommends the use of transactions so that: 1) when a User is created, a corresponding Wallet must be created 2) if the creation of the Wallet fails, the creation of the User should be rolled back to prevent incomplete data. What is the main issue with this solution?

In order for the User creation to roll back in case the Wallet creation later fails, the User microservice must keep some information about the current state of the Wallet service in its memory. Since RESTful APls are meant to remain stateless by definition, the use of transactions breaks the specification.

7. In the scenario described in (6), what is one way you might accomplish a similar behavior?

The concept of eventual consistency could be useful for this problem. Eventual consistency means that after some amount of time N, the entire distributed system will be in its correct state. For example, if the Wallet creation were to fail after the User creation, at time (N-1) the User creation will not have been rolled back; however, at time N the creation will have been rolled back. To accomplish this, individual microservices can publish events to a message queue, and other services can monitor the message queue to invoke business logic for operations like User and Wallet creation. In our case, for example, one naive architecture might be that the User service publishes “User created” messages to a message queue that the Wallet service later reads to invoke its own creation process. The Wallet ←→ User relationship will be eventually consistent.

8. What is a load balancer? Describe why load balancing is an important piece of many distributed systems.

A load balancer works to spread traffic across a number of different servers in order to make applications more responsive and available. For example, imagine 1 million users make requests to your application simultaneously. By putting a load balancer between your application and the users who are making requests, you can gain the benefit of having redundant versions of your application running on separate machines. If you had 5 machines running your application, the load balancer might send 200,000 of the 1,000,000 requests to each of the 5 machines, rather than having all of the requests processed by one machine.

A load balancer can also keep track of the status of the various machines running in your distributed system. For example, if a server fails and is not responding to requests (or perhaps it is responding with more errors than the other servers), the load balancer can stop sending a request to that particular server. Depending on the configuration, a load balancer might be responsible for spinning up new servers to handle more traffic, or for killing & replaces servers which seem to have failed.

Load balancing is important because it is an effective tool for keeping an application high availability. Even if there is a power outage and some computers in your distributed system are not available, a load balancer combined with duplicate instances of your application can ensure your application is still available for users.

10. How can distributed computing help large software teams be productive or organized?

Using distributed computing can be useful for large software teams, because the larger team can be organized into smaller units while reducing the risk that any given team introduces breaking changes to the wider system. Because distributed systems rely on message passing to communicate, and focus on high availability and performance of individual components, smaller teams can define a set of abstractions for only their part of the system. For example, the Surveys team and the Users team might define their own REST APIs for communicating with their respective components. Once these abstractions are defined, team members need to be less concerned about their changes impacting other pieces of the software. As long as the abstractions they have defined for their components do not change (e.g. you always retrieve a list of users from the GET /api/v2/users REST endpoint), the internal business logic can be changed with a low amount of risk.

11. What is fault tolerance in the context of a distributed system?

Fault tolerance describes the distributed system’s ability to continue to function in the event of a partial failure. Faults are most often caused by hardware or software problems, or by malicious actors. Analyzing and increasing the fault tolerance of a distributed system involves solving for these factors.

12. How can data replication help improve the fault tolerance of a distributed system?

Data replication is the storage of replica data sets at a number of locations that can be interchangeably accessed by the application. If one data source encounters a hardware or software problem, the replica data set can be accessed quickly, leading to little risk of downtime for users. This increases the availability of the application thereby making it more fault tolerant.

13. What exactly is a distributed system?

A distributed system is a collection of independent computers linked together by a network. They appear to their end users as a single coherent system. In a distributed system, components located within the network communicate and organize their actions by passing messages.

14. What are the characteristics of a distributed system?

Distributed systems have a handful of core differentiating characteristics. In such a system, programs are executed concurrently, there is no global time, and components can fail independently without causing a full system failure or crash.

15. What are some examples of distributed systems used today?

Some examples include the Internet, intranets, and mobile or ubiquitous computing.

16. What are some disadvantages of distributed systems?

No system architecture is perfect, so distributes systems definitely have their downfalls. Downfalls of developing distributed software include networking problems and security problems, though a good team of technicians and developers can usually tackle common issues quickly and efficiently.

17. What are the main differences between mobile and ubiquitous computing?

Mobile computing has a unique advantage when using different devices, such as mobile devices, laptops, and printers. Ubiquitous computing is used in a single environment, such as at home or in hospitals.

18. Why do we need openness?

The degree to which a complex computer system can be extended and re-implemented depends on openness.

19. What are so security mechanisms that are used in distributed computing?

There are several security mechanisms used, including encryption, authentication (passwords, public key authentication, etc.), and authorization (access control lists.)

20. How does one provide security through a distributed system?

Confidentiality can be implemented by utilizing protection against disclosure to unauthorized individuals, such as access control lists that provide authorized access to sensitive information. The integrity of the system can be improved by implementing protection against alternation or corruption. Availability, or protection against interference targeting access to the resources, can be implemented as well to block denial of service (DoS) attacks.

Proof of sending and receiving information can be established through the use of digital signatures.

21. What is scalability?

A distributed system should work efficiently at a range of different scales, from a small Intranet to the whole Internet. There are some challenges in designing scalable distributed systems, which include the cost of physical resources, problems getting the cost to linearly increase with system size, and performance loss. These challenges can take some work to tackle and often require a team.

22. What are the different types of system models?

Architecture model, fundamental model, interaction model, failure model, and security model.

23. Why is Middleware used in a distributed system?

Middleware is essentially a layer of software with the sole purposes of masking heterogeneity and providing a convenient and usable programming model to application programmers and developers. Middleware is represented by processes and objects in a set of computers that interact with each other in order to implement communication and resource sharing support for distributed applications in the system.

24. What is protocol?

By definition, a protocol is used to refer to a common set of rules and formats that are used for communication between processes in order to perform a specific task.

The definition of a protocol has two vital parts to it: A specification of the sequence of communicative messages that must be exchanged, and a specification in the format of the data in the messages.

25. What is mobile and ubiquitous computing?

Mobile and ubiquitous computing are examples of a distributed system. In mobile computing, computing devices are being carried around and are portable. In ubiquitous computing, small systems are established in a stationary fashion.

26. What are some challenges developers and technicians will usually face when developing and implementing a distributed system?

Some challenges include heterogeneity, openness, security, scalability, failure handling (which is usually the first thing tackled as failure prevention is the main benefit of distributed systems), concurrency, and transparency.

27. What are the main advantages of a distributed system?

In the context of a distributed system for a business, some advantages include improved performance, distribution, reliability (such as fault tolerance), incremental growth over time, simple sharing of data and resources, and improved communication throughout the system.

author: patrick algrim
About the author

Patrick Algrim is an experienced executive who has spent a number of years in Silicon Valley hiring and coaching some of the world’s most valuable technology teams. Patrick has been a source for Human Resources and career related insights for Forbes, Glassdoor, Entrepreneur, Recruiter.com, SparkHire, and many more.


Help us by spreading the word