The series of questions below are intended to either help those who are looking to hire engineers by providing Hadoop interview questions as a starting point. Or potentially those looking to get hired and would like a study sheet. These are some fairly common yet important questions to quiz yourself on the understanding of Hadoop. The answers are included as part of the questions, but please be sure to try and answer these interview questions in your own way and then later check them against the list.
1. What are the main differences between HDFS and NAS?
Whereas HDFS which is short for Hadoop Distributed File Systems is a distributed file system that uses commodity hardware to store data, Network Attached Storage (NAS) is a file-lever server that stores data and is connected to a computer network.
The data blocks stored in HDFS are evenly distributed on all machines present in a cluster whereas in NAS, data is stored in a dedicated hardware.
Storing data in NAS is expensive since data is stored in high-end devices while HDFS stores data in commodity hardware making it less expensive.
HDFS works hand-in-hand with MapReduce while in NAS, data computation and storage occurs separately and MapReduce is not involved in the process.
2. How does Hadoop MapReduce work?
The Hadoop MapReduce framework is used to count words in each document during the mapping phase. At the same time, it aggregates data per document in the reduce phase. During the map phase, data is further divided into slips before being analyzed by map tasks that run parallel against each other within the Hadoop framework.
3. What is Shuffling in MapReduce?
Shuffling is the process performed by the system to sort and transfer map outputs to the reducer.
4. In MapReduce Framework, what is referred to as distributed cache?
Distributed Cache is a feature in a map reduced framework that helps in sharing files (could be executable jar files or simplified properties files) within a Hadoop Cluster.
5. In Hadoop, what is NameNode?
NameNode is the storage center for all the file location information in a Hadoop Distributed File System (HDFS). In other words, NameNode is the nerve center of a HDFS file system.
6. What is a JobTracker in Hadoop?
JobTracker is a feature in Hadoop that runs on its own JVM process and is used to submit and track MapReduce jobs.
7. What functions does JobTracker perform?
Communicates with the NameNode to determine location of data
• Locates TaskTracker nodes
• Monitors TaskTracker nodes
• Receives jobs after submission by the client through the application process
8. What is heartbeat in HDFS?
Heartbeat refers to the signal that exists between the DataNode and the NameNode -or between the TaskTracker and JobTracker. If for example the NameNode or JobTracker does not responds to the signals, it is assumed that something could be wrong with either the DataNode or TaskTracker.
9. What are combiners and when should they be used in a MapReduce job?
Combiners are used to increase the efficiency of the MapReduce Program. This efficiency is achieved when the amount of data that needs to be transferred to the reducers is significantly reduced by the combiners. If the process outlined above is cumulative, reducer codes can be used as combiners.
10. What happens if a DataNode Fails?
When a DataNode fails, three things happen;
• The JobTracker and the NameNode first detect the fail
• All the tasks on the failed-node are then rescheduled
• The NameNode replicates user data to other functioning nodes
11. What is speculative execution in Hadoop?
In Hadoop, speculative execution is the period under which several duplicate tasks are launched. For example, by using Speculative Execution, multiple copies of slave map or reduce tasks can be initiated on different slave nodes.
12. What are the basic parameters of a Mapper?
LongWritable and Text
Text and IntWritable
13. What role does the MapReduce partitioner play in Hadoop?
To ensure that the full value of a single key ends up at the reducer thereby making sure that there is an even distribution of the map output over the reducers.
14. What is the difference between an Input and HDFS Block?
An Input Split is the logical division of data while a HDFS Block is the physical division of data.
15. What happens in textinformat?
This is where each line in the text file is recorded.
16. In order to run a MapReduce job, what configuration parameters should the user specify?
Job input location within the distributed file system
Job’s output location within the distributed file system
Input and output formats
Class containing map and reduce functions
JAR file containing the mapper, driver and reducer classes
17. In Hadoop, what is WebDAV?
It’s a set of extensions to HTTP that supports editing and updating of files.
18. What is Sqoop in Hadoop?
This is the tool that allows for the transfer of data between relational database management (RDBMS) and Hadoop HDFS.
19. How does the JobTracker schedule tasks?
In order to make sure that the JobTracker is active and functioning, the TaskTracker sends out heartbeat messages every few minutes. The JobTracker is therefore able to keep tabs on the number of available slots and where in the cluster work should be delegated.
20. What is Sequence fileinputformat?
It is a compressed binary file that has been optimized to allow data passage between the input and the output of a MapReduce job
21. What is the difference between Hadoop and an RDBMS?
Hadoop is a node-based flat structure while RDBMS is a relational database management.
Hadoop is used for analytical and data processing while RDBMS is used for OLTP processing.
You need to reprocess RDBMS data before storage while in Hadoop, this is not a requirement.
22. What is the role of the conf.setMapper Class?
The conf.setMapper Class ‘role is to set the mapper class as well as all the other processes related to the map job such as data reading and key-value pair generation.
23. What are the core components of Hadoop?
24. Which data components does Hadoop use?
25. What data storage components are used by Hadoop?
26. What are the most used input formats in Hadoop?
27. What is referred to as a sequence file in Hadoop?
A sequence file is used to store binary key/value pairs. Unlike regular files, a sequence file has the ability to support splitting even when the data inside the file has been compressed.
28. What is InputSplit?
This is an internal function that splits input files and later assigns each split to a mapper for processing.
29. How do you write a custom partitioner?
A custom partitioner is written by following the path below,
• Creating a new class that acts as an extension of the Partitioner Class
• Conduct method getPartition override.
• Using method set Partitioner Class, add a custom partitioner to the job as a configuration file
30. How is indexing in HDFS done?
HDFS indexes data by storing the last part of the data which then point out to where the next part of the data should be.
31. What are the different types of Channels in Flume?
MEMORY Chanel-this channel reads events from the source into memory before passing them to the sink.
JDBS Channel-This channel stores events in an embedded database.
FILE Channel-this channel writes content having received the said information from the source. The file is only deleted after the contents have successfully been delivered to the sink.
In order to ensure that there is no data loss, which is the most reliable Flume Channel?
The FILE Channel is the most reliable of the three types of file channels.
32. What are the main configuration files of Hadoop?
33. Besides using jps, how else can you check whether NameNode is working?
By using /etc/init.d/hadoop-0.20-NameNode status
34. In Hadoop, what is a “map” and what is "reducer"
A map is a process in HDFS query solution where data is read from an input location and based on the input type; a key value pair is generated.
A reducer collects the key value pair generated, processes it and generates its own final output.
35. Which file in Hadoop controls reporting?
The Hadoop-metrics properties file
36. What are the network requirements for using Hadoop?
Password-less SSH connection
Secure Shell (SSH) for launching server processes
37. What is rack awareness?
This is the process by which the NameNode determines where to place blocks based on rack definitions
38. What is a Task Tracker in Hadoop?
This is a slave node daemon that receives tasks from the JobTracker. The TaskTracker also sends heartbeat messages to the JobTracker as a way of confirming that the latter is still functioning.
39. Which daemons are run on the master node and which are run on the slave nodes?
NameNode are the daemons run on the Master node while the TaskTracker and Data are run on the SlaveNode.
40. How do you debug a Hadoop code?
Using a web interface that is provided by the Hadoop framework
41. What is a storage and computer node?
A storage node is the location where your file system stores processed data while a computer node is the location where the actual business logic is executed.
42. What is Context Object used for in Hadoop?
It enables the mapper interact with the rest of the system. The Context Objects contains data configuration and an interface that allows it to generate output.
43. What is the next step after MapTask?
Sorting out and creating partitions for the output generated by the mapper.
44. In Hadoop, what is the number of default partitioner?
45. In Hadoop, what role does the RecordReader play?
It loads data from the source and converts it into (key, value) pairs that can be read by the MapTask.
46. When no custom partitioner has been defined in Hadoop? How is data portioned before being sent to the reducer?
In the absence of a defined custom partitioner, the default partitioner computes a ‘Hash’ value for the key and subsequently assigns partitions based on the results.
46. What happens when 50 spawned Hadoop tasks are compiled for a job and one of the tasks fails?
When one of the tasks fails among a collection of Hadoop tasks, the process will restart on a different TaskTracker.
47. Which is the best way to copy files between HDFS clusters?
When copying files between clusters, multiple nodes and the distCP should be used in order to allow for the workload to be shared.
48. What is the main difference between HDFS and NAS?
NAS data is stored only on dedicated hardware while HDFS datablocks are distributed across all local drives on all machines within a cluster.
49. How is Hadoop different from other data processing tools?
In Hadoop, you can be able to increase or reduce the number of mappers without having to worry about the volume of fata to be processed.
50. What is the role of conf class in Hadoop?
To separate different jobs within the same cluster
51. What is the Hadoop MapReduce APIs contract for a key and value class?
There are two main Hadoop MapReduce APIs contract for a key and value class.
The org.apache.hadoop.io.Writable interface
The org.apache.hadoop.io.WritableComparable interface
52. Hadoop can be run in three different modes. Which are these modes?
Standalone (local) mode
Fully distributed mode
Pseudo distributed mode
53. What does the text input format do?
It generates a line object that is a hexadecimal number. This value operates as a whole line text while the key operates as a line object. The value is received by the mapper as a text parameter while the key is received as a long writeable parameter.
54. How many InputSplits does a Hadoop Framework make?
1 split for 64K files
2 split for 65mb files
2 splits for 127mb files
55. In Hadoop, what is referred to as distributed cache?
This is a facility provided by MapReduce framework and is used to cache files during job execution. Before the execution of any task takes place at the node, the Framework first copies all the required files to the slave node.
56. What role does the Hadoop Classpath play in starting or stopping Hadoop daemons?
The Classpath consists of a list of directories that contain far files to start and stop daemons.
57. What determines the number of input splits?
The number of mappers
58. Can the number of mappers to be created be changed for a job in Hadoop?
59. What happens to the JobTracker when the NameNode is down?
When the NameNode is down, the cluster will set off since it is the single source of failure in HDFS.
60. What is Big Data in Hadoop?
In Hadoop, Big Data refers to symbols or characters on which operations are performed by a computer and are transmitted and stored in form of electronic signals.
61. What are the input formats contained in Hadoop (explain).
There are three main input formats in Hadoop:
Text input format: This is the default input format in Hadoop
Sequence File Input Format: used to read files in a sequence
Key Value Input Format: used for plain text files
62. What is YARN?
YARN stands for Yet Another Resource Negotiator. This is a processing framework in Hadoop that manages the resources as well as establish an execution environment.
63. What is “Rack Awareness” in Hadoop?
In Hadoop, Rack Awareness is the algorithm that allows the NameNode to determine the number of blocks and their replicas to be stored in a Hadoop cluster. This process takes place via rack definitions that initiate a slowdown of traffic between DataNodes within the same rack. For example, the default value attached to a replication factor is 3.
The “Replica Placement Policy” has it that two replica copies are stored in a single rack for every data block. However, the third copy is stored in a different rack.
64. What is Speculative Execution?
Speculative execution refers to the process taking place during a slow task execution at a node. During this process, the master nodes begin executing a different instance of the same task on the other node. The task which comes to completion first is accepted while the other task is stopped.
65. What are some of the global companies that use Hadoop?
66. What are the main differences between RDBMS and Hadoop?
RDBMS is used for storing structured data while Hadoop stores any kind of data (structured, unstructured and semi-structured).
Hadoop is based on a “Schema on read” policy while RDBMS uses the “Schema on write” policy.
Hadoop is open-source software hence one does not incur any cost to acquire it while RDBMS is licensed software and one needs to pay for it.
Hadoop is used for data discovery, analytics and OLAP systems whereas RDBMS is used for Online Transaction Processing (OLTP).
67. What is the difference between Hadoop 1 and Hadoop 2?
Hadoop 1 has a single NameNode which is the single point of failure while Hadoop 2 has both an active and a passive NameNode. When the active NameNode fails, it is replaced by the passive NameNode.
68. What is the role of the two main types of NameNodes?
There are two types of NameNodes in a Hadoop structure
• Active NameNode: It is the node that runs the Hadoop structure
• Passive NameNode: This is the standby node that stores the same data as the active NameNode.
• In case the active NameNode fails, the Passive NameNode takes charge thereby ensuring that there is always a running NameNode within the cluster.
69. What are the key components of Apache HBase?
Region Server: The Region Server serves a group of regions to the clients.
HMaster: The HMaster manages and coordinates the Region Server.
ZooKeeper: The ZooKeeper coordinates processes in the HBase distributed environment. This coordination is made possible where the ZooKeeper maintains server state inside the cluster by communicating in sessions.
70. What are the different types of schedulers in Hadoop?
COSHH: The role of COSHH is to schedule decisions by considering the workload, cluster and using heterogeneity.
FIFO Scheduler: The FIFO Scheduler orders the jobs based on their time of arrival in a queue without using heterogeneity.
Fair Sharing: Fair Sharing refers to individual users that contain the number reduced slots and maps on a resource. In the execution of jobs, each user is allowed to use own pool.
71. Can NameNode and DataNode be commodity hardware?
DataNodes are essentially commodity hardware as they have the ability to store data in a similar way that laptops and personal computers do. NameNode on the other hand is the master node that stores metadata regarding all the blocks in HDFS. This means that it needs a high memory space.
72. During the deployment of Hadoop in a production environment, what important hardware should be taken into consideration?
Operating System: It is recommended that the system in use is a 64-bit OS as it avoids restrictions on the amount of data in use at the worker nodes.
Storage: In order to achieve high performance and scalability, a Hadoop platform should have adequate data storage space.
Capacity: Large Form Factor discs cost relatively lower but can allow for more storage of data.
Network: in order to reduce redundancy, the Two TOR stitches per pack are recommended.
73. What should be taken into consideration when deploying a secondary NameNode?
A secondary NameNode should always be deployed on a standalone system in order to prevent it from interfering with the operations taking place at the primary node.
74. Why is it recommended that HDFS be used for application with large data sets but not those with multiple small files?
HDFS has been found to be more efficient when dealing with a large number of data sets maintained in a single file as opposed to smaller data chunks stored in multiple files.
When the NameNode stores metadata for the file system in RAM, the amount of free space limits the number of files in HDFS file system. Ideally, this means that more memory will be required in RAM based on the fact that more files generate more metadata. It is important that the metadata of a block, file or directory not be more than 150 bytes.
75. What roles does the JobTracker play in Hadoop?
In Hadoop, the JobTracker;
• Preforms resource management, tracks resources availability and manages tasks lifecycle.
• Identifies data location by communicating with the NameNode
• Ensures the execution of tasks on the nodes by identifying the most suitable tracker node
• Manages all the task trackers individually and later submits the overall job to the client.
• Tracks the MapReduce workloads execution from the local node to the slave node
76. What is Hadoop MapReduce used for?
To process large data sets in a parallel manner within a Hadoop cluster.
77. What is the difference between HDFS Block and Input Split?
The HDFS Block refers to the physical division of data whereas an Input Split refers to the logical division of data.
78. What happens when more than two people access the same file in HDFS?
When the NameNode receives a request from the first person to access a file in Hadoop, the NameNode provides a lease to create the file. If a second person tries to access the same file, the NameNode will establish that the lease of the said file is already in use and hence reject the second request.
79. What is the benefit of Checkpointing?
Checkpointing decreases the startup time of the NameNode.
80. What are some of the tools that enhance the performance of Big Data?
81. What is a SequenceFile?
SequenceFile is a flat file that contains both the value pairs and the binary key. This file is used in processing the Input/output formats of the MapReduce. The outputs from the MapReduce are stored internally in the SequenceFile in the following formats.
Record compressed key/value records- In this file format, values are compressed.
Uncompressed key/value records-this format has neither the values nor the keys compressed.
Block compressed key/value records –this is whereby both the keys and values are stored in blocks and later compressed.
82. In Copy operation, what Hadoop shell commands are used?
83. How can you establish whether the the NameNode is properly working using jps command?
When checking whether the NameNode is working properly, the following jps command used is /etc/init.d/hadoop-0.20-namenode
84. In Hadoop, what is commodity hardware?
These are systems that comprise of a RAM since there are specific tasks that need to be executed. Commodity hardware are less expensive and do not require high-end hardware or supercomputers to execute the job.
85. What are the port numbers for NameNode, Task Tracker and Job Tracker?
Job Tracker- 50030
86. How does the process of inter-cluster data copying take place?
When data copying takes place within a Hadoop cluster, this process is referred to as inter cluster data copying. Inter cluster data copying begins when the HDFS makes available a distributed data copying facility through the distCP from the source code. It is a requirement that the source and destination have the same version of Hadoop for distCP to operate.
87. What is indexing in HDFS?
This is the process through which HDFS stores the last part of data-often stored depending on the block size-to the address where the next data chunk will be stored.
88. What happens in case a NameNode has no data?
There is no NameNode that does not contain data. If there is no data, then it is not a NameNode.
89. What happens when a client submits a job in Hadoop?
When a job is submitted by a client in Hadoop, it is received by the NameNode which then looks for the data requested before providing the block information. From here, the JobTracker allocates resources for the Hadoop job to make sure that it is completed on time.
90. What is the difference between Sqoop and distCP?
DistCP is used to transfer data between different clusters while Sqoop is only used to transfer data between Hadoop and RDBMS.
91. What are the core components of Flume?
Event-a unit of data or a single log entry to be transported
Source-this s the component that allows data to enter the Flume workflows
Sink-transports data to the desired destination
Channel-the conduit between the Source and the Sink
Agent-the JVM running the Flume
Client-Transmits event to the source
92. Is Flume 100% reliable when it comes to the flow of data?
Because of its transactional approach when it comes to data flow, Apache Flume provided end to end reliability meaning that it is 100 percent reliable.
93. In Hadoop, can files be searched using wildcards?
94. What are the core components of a Hadoop application?
Data Access Components that comprise of - Pig and Hive
Data Storage Component which is - HBase
Data Integration Components that comprise of - Apache Flume, Sqoop, Chukwa
Data Management & Monitoring Components that comprises of - Ambari, Oozie and Zookeeper
Data Serialization Components include - Thrift and Avro
Data Intelligence Components include - Apache Mahout and Drill
95. What functions does the ZooKeeper carry out in HBase architecture?
In HBase architecture, the ZooKeeper is primarily a monitoring server that carries out among other functions;
• Tracking server failure
• Maintaining information in the configuration
• Establishing communication between the client and the server
• Maintaining usability of the ephemeral nodes for them to be able to identify available servers within the cluster
96. Which versions of Hadoop are stable?
97. In Big Data, what does the four V’s denote?
Volume-refers to data scale
Velocity-refers to the analysis of data streaming
Variety-refers to the different types of data
Veracity-this refers to data uncertainty
98. What is the difference between structured and unstructured data?
In a traditional database system, data that can be stored in rows and columns is referred to as structured data. An example of structure data is an online transaction. Another form of data is one that can be stored partially in a traditional database system and is referred to as semi-structured data. An example of semi-structured data is XML records. Unorganized data that can neither be categorized as structured or semi structured data is referred to as unstructured data.
99. What are some of the properties for the best hardware configuration to run Hadoop?
Dual core machines and processors.
4GB or 8GB RAM that use ECC memory.
100. What is a block in HDFS?
A block is the smallest storage location in a hard drive. Data in HDFS is stored in blocks before being distributed through the entire Hadoop cluster. During the storage process, the files are first divided into small blocks before being stored as separate units.
In general, you should also ask any candidate about projects where they've implemented Hadoop. Issues they've had with attempting to implement Hadoop either for themselves or clients. This can be helpful in understanding some of the processes behind the professional and their experience level. Being able to ask what they might change next time they attempt to integrate Hadoop into a project and/or organization can be insightful in making a great hiring decision. If you are interviewing, be prepared with some case studies or examples that can help an employer understand your domain experience with Hadoop. I would also recommend being prepared to share your general interest with the technology and how it can benefit a series of departments inside an organization. Having that information prepared ahead of time can be very helpful if you are being interviewed. As for those conducting the interview, asking those questions can also be very insightful. You can gain insight about the technology and learn someones true interest in building it forward for your business.