Data Analysis is becoming the core innovation method for many internet startups. Because so many of the products going into the market these days are similar, operators of the business have to think quantitatively about their approach. This where the need for business intelligence or data analysts come into play. They can help play a major role in understanding the customer and helping to better design products for them. They can also help address broader market changes and competitive analysis to provide a lens for the executive team that is immeasurable. These data analyst interview questions and answers should help you get started in your employment journey.
1. What is required to become a data analyst?
• Technical knowledge when it comes to database design, data modes, mining and segmentation techniques
• Strong knowledge of the statistical packages for analyzing the large datasets such as Excel, SPSS and SAS.
• Strong skills on analysis, organizing, collection and dissemination of the big data
2. What is the responsibility of the data analyst?
• Give support to data analysis and coordinate with the clients and the staff
• Resolve the business associated issues for the clients and performing audits on the data
• Analysis of results and interpreting data with the use of statistical methods and giving ongoing reports.
• Prioritization of the business needs and working with management and the information needs.
• Identification of new process or the areas of improvement opportunities
• Acquiring data from the primary or the secondary data sources and maintaining databases systems
• Filtering and cleaning the data and reviewing the computer reports
• Determining performance indicators or locating and correcting the code issues
• Securing the database through the developing of access systems through determining the user levels of access
3. Explain the process of data analysis.
Data analysis considers the collection, inspection, cleansing and transformation as well as the modeling of data in order to gain the best insights and support decision-making within the firm. The different steps in the data analysis process would include the following:
• Data exploration: upon the identification of the business problem, the data analyst would go through the data as given by the client to analyze the cause of the problem
• Data preparation: This is very important in the data analysis process where the data anomalies such as the missing values and outliers with the data have to be modeled in the appropriate way.
• Data modeling: the step for modeling starts at the time the data has been prepared. It is iterative as a process where the model is run repeatedly for the purpose of improvement. Modeling makes certain the best possible result is attained for particular problems.
• Data validation: during this step, the model that is given by the client and the model developed according to the data analyst would be validated against each other to find if the developed model is going to meet the requirements.
• Implementation and tracking of the model: this represents the final step of the process for data analysis where the model is implementation in production and it has been tested for both accuracy and efficiency.
4. How often should you retrain a data model?
A good data analyst is one who is aware how the changing dynamics in business are going to affect the efficiency of the predictive model. The analyst ought to be a valuable consultant that can use analytical skills and business acumen for finding the root cause of the business problems. The best way for one to answer that question would be to say that you would work with the client in order to define a particular period. Though, it would be advisable to retrain a model when the firm enters a new market, enters a merger or starts to face an emerging competition.
5. How would a data analyst handle the QA process when coming up with a predictive model in forecasting the customer churn?
Data analysts need input from the proprietors and a collaborative environment where they can operationalize the analytics. In order to create and deploy the predictive models in production, there has to be an efficient and repeatable process. Without taking some feedback from the proprietor then the model would only be a one and done model. The best way to do this would be to partition the data into 3 sets that is training, testing and validation. The results of the validation would then be presented to the business owner after the elimination of the biases from the first 2 sets. The inputs from the client would provide an idea thereafter on whether the model predicts customer churn with accuracy and gives the right results.
6. How is one to create a classification to identify the main customer trends when it comes to unstructured data?
A model may not hold any value if there are no actionable results. The experienced data analyst is going to have a varying strategy according to the type of data being analyzed. Any sensitive data of the customer needs to be protected such that it is advisable to consult with the stakeholder to make sure you are going along with the compliance regulations of the organization and the disclosure laws. You may first consult with the stakeholder to understand the objective of classifying this information. Then you can use iterative processes for pulling the data samples and modifying the model accordingly and evaluating it for the purposes of accuracy. You can mention you would follow the processes of mapping the data, initiating an algorithm, mining the data and visualizing it.
7. What is data cleansing?
Data cleaning refers to data cleansing and entails the finding and removal of errors and inconsistencies from the data to enhance the quality of the information.
8. What are some of the best practices for data cleansing?
• Sorting the data according to different characteristics
• For the larger datasets, they need to be cleaned in steps and then improved with each step up to the point a good quality can be achieved.
• For the larger type datasets, they ought to be broken down into smaller data. Working with less data is going to increase the speed of one’s iteration.
• In order to handle the common cleansing, create a set of utility functions/ scripts and tools. This may include the remapping of values according to the CSV file or SQL database or regex search and replace to blank out all of the values, which do not match the regex.
• If there is an issue with data cleanliness, then they ought to be arranged by estimated frequency and then you can attack the most frequently occurring issues. Analyze the summary statistics for each of the column as per their mean, standard deviations and the missing vales.
• Consider every data cleaning operation so that changes can be done or operations can be altered if need be.
9. Why is data mining useful as a technique in big data analysis?
Big data Hadoop is one of the clustered architecture where there is a need to analyze a large data set in order to identify the unique type of patterns. These patterns assist for one to understand the problem areas of business in order to create a solution. The data mining is very useful as a process for this task and that is why it is widely used when it comes to big data analysis.
10. Discuss data profiling?
Data profiling is a process that validates the data that is there in an existing data source and to understand if it can be readily utilized for other means.
11. What is logistic regression?
This is one of the statistical means used by data analysts for examining datasets where single and multiple independent variables determine the outcome.
12. What is the data screening process?
The data screening is a part of the validation process where the complete set of data is processed through a number of validation algorithms in order to ascertain if the data has any business related problems.
13. What is the K-mean algorithm?
The k-mean algorithm is that which is used for the purposes of data partitioning within a clustered type of architecture. Within the process, data sets can be classified through different clusters. So objects are divided into different k groups. According to the k-mean algorithm:
• As the clusters attain shape of a sphere the data points within the clusters are centered.
• The spread or the variance of the cluster is almost a bit similar.
14. What is the main difference between data profiling and data mining?
• Data mining is particularly focused on cluster analysis, the detection of unusual records and sequence discovery as well as relation holding between different attributes.
• Data profiling concerns the instance analysis of unique attributes. It provides information on the different characteristics such as discrete value, value range and their frequency, not to mention the occurrence of null values.
15. What are the common problems that are faced by the data analyst?
• Duplicate entries
• Missing values
• Illegal values
• Varying value representations
• Common misspelling
• Identifying the data which is overlapping
16. What is the name of the framework that was developed by Apache for processing of large data sets for an application within a distributed computing environment?
MapReduce and Hadoop is the programming network that was first developed by Apache for the processes of processing large data set for an application within a distributed computing environment.
• What are the missing patterns, which are mostly observed? Missing at random
• Missing which is dependent on the missing value
• Missing which is dependent on the unobserved input variable
• Missing completely at random
17. What is collaborative filtering?
This is a simple type of algorithm, which creates a recommendation system that is based according to the behavioral data of the user. The most important part of the collaborative filtering would be the users-items-interest. An example of collaborative filtering is that time you see a statement that goes like ‘recommended for you’, on an online shopping platform that gives a line of suggestions depending on your browser history. So it considers what you have been shopping and then gives further products that could fit that type according to price and style.
18. Discuss what is KPI, the 80/20 rule and design of experiments
• KPI: this is to mean Key Performance Indicator and is a metric, which entails any combination of reports, spreadsheets or even charts concerning the business process.
• 80/20 rules: this means 80 percent of one’s income comes from 20 percent of the clients.
• Design of experiments: it is the initial process that was used to split the data, sample and then set up for data for statistical analysis.
19. What is Map Reduce?
Map-reduce is a framework that processes the large data sets and splits them into subsets, processing of the subsets on a different server type and then blending the results attained on each of them.
20. What are some of the tools, which are used in Big Data?
21. Explain what is outlier?
The outlier is a term that is utilized by analysts in reference to a value that seems distant and diverges from the original pattern of the sample. There are two modes of these outliers.
22. Explain the hierarchal clustering algorithm.
This is a process for combining and dividing the existing data groups in order to create a hierarchal structure out of that to represent the order in which the groups are merged or divided.
23. What is time series analysis?
Time series is the process of forecasting the output of the process through the analysis of the previous data with the use of different statistical means like the log linear regression method, exponential smoothening and so forth. This can be done in two domains and these would be the time and frequency domain.
24. What is clustering when it comes to data analysis?
The clustering that is in data analysis defines the proves of grouping a particular set of objects according to particular parameters which have been pre-defined. This is one of the industry recognized data analysis approaches especially utilized within big data analysis.
25. What is Correlogram Analysis?
A correlogram analysis refers to the form of spatial analysis in geography. It entails a series of estimated autocorrelation coefficients that have been calculated for a different spatial relationship. It may be utilized to construct a correlogram for a distance-based data when the data is expressed as a distance as opposed to values at individual points.
26. What is a hash table?
When it comes to computing, the hash table is a map of the keys to the values. It is a data structure utilized to implement an associative array. It utilizes a has function for computing an index into an array of different slots from which the desired value would be attained.
27. What are hash table collisions and how are they avoided?
A hash table collision occurs when there are two different keys that hash to the same value. Two data cannot be stored in the same slot in the array. There are two main techniques, which can be used to avoid table collision.
• Separate chaining: this uses the data structure in order to store different items, which has to the same type of plot.
• Open addressing: this searches for the other slots with the use of a second function and stores the items in the first empty slots, which are found.
28. Describe imputation and list the different types of imputation approaches there are.
During the time of imputation, there is a replacement of the missing data with values, which have been substituted. The types of imputation are single and multiple. For the single imputation approach;
• Cold deck imputation: this one selects donors from another datasets at an advanced level
• Hot deck imputation: here a missing value can be imputed from a randomly selected similar record with the assistance of the punch card.
• Mean imputation: this entails the replacing of the missing value with the mean of that specific variable for the other cases.
• Regression imputation: this entails the replacing of missing value with the predicted values of a variable according to other variables.
• Stochastic regression: this is the same as regression imputation though there is the addition of the average regression variance to regression imputation.
As opposed to single imputation, the estimates of multiple imputation estimates the values a lot of the times.
29. Which imputation method is the one, which is favorable?
Even though single imputation is used a lot of the time, it does not reflect the uncertainty created by missing data randomly. Thus, multiple imputation happens to be more favorable as compared to single imputation in the event of data, which is missing at random.
30. What is the course of action with missing or suspected data?
• Prepare a validation report that will give information concerning the missing or suspected data. You will provide the detail information such as the validation fails with date and time stamps.
• The suspected data can then be further examined in order to validate their credibility
• The invalid data ought to be replaced and assigned with a validation code
• Then embark on the best data analysis approaches including deletion method, model based methods or single imputation to work on the missing data.
31. How does one deal with the multi-source problems?
• Perform a schema integration through restructuring of the schemas
• Identifying and merging similar records into one record that has all of the relevant attributes without redundancy.
32. What would be the optimal attributes of a good data model?
• It needs to be scalable for large data changes
• It can be consumed easily.
• It should be performed in a manner, which is predictable.
• It can be adapted if the requirements of the model where to be changed.
33. What are some of the statistical methods that are utilized for data analysis?
• Simplex algorithm
• Bayesian approach
• Markov chains
• Mathematical optimization
• Cluster and spatial processes
• Rank statistics
34. What are some of the data validation methods that were utilized in data analytics?
• Form level validation: in this method, validation is done at the time the user completes the form before a save of the information is required
• Field level validation: validation is done in each field as the user enters the data so as to avoid any errors that have been caused by human interaction
• Data saving validation: this validation approach is done during the saving process of the actual file or database record. This may be done when there are a number of data entry forms.
• Search criteria validation: this validation type would be relevant to the user to match what the user is searching for to a particular degree. It is to make certain the results are returned actually.
35. What is the difference between the true positive rate and recall?
There is no particular difference as they are the same with the formula denoted as: (true positive)/(true positive + false negative)
36. What would be the differences between logistic and linear regression
• Linear regression needs independent variables to be continuous but when it comes to logistic regression they can have dependent variables with more than two categories.
• The linear regression is based according least square estimation while logistic regression is based according to maximum likelihood estimation.
• Linear regression is directed at finding the best fitting straight line where the distance between the points and the regression line would be errors. Logistic regression is used in the prediction of a binary outcome as the resultant graph is S – curved.
• Linear regression also requires 5 cases for each independent variable. Logistic regression requires at least 10 events for each independent variable.
37. What are the conditions where it would be best to use a t-test or a z-test?
The T-test is used when there is a sample size of less than 30 and the z-test is best used when there is a sample test that is greater than 30.
38. What are the main methods for the detection of outliers?
• Box plot method: if the value happens to be higher or lesser than 1.5 for the inter quartile range and above the upper third quartile or below the second quartile respectively then it would be considered as an outlier.
• Standard deviation: if the value is higher or lower than the mean or plus/negative three times the standard deviation, then it would be considered as an outlier.
39. What is the difference between the standardized and the non-standardized coefficients?
The standardized coefficient is interpreted according to the standard deviation though the unstandardized coefficient would be measured in actual values.
40. What is the difference between R-squared and adjusted R-squared?
• R-squared measures the proportion of the variation within the dependent variables as explained by the independent variables. The adjusted R squared provides the percentage of variation as explained by the independent variables that in reality affect the dependent variable.
41. What are the main skills that are required for data analysts?
• Database knowledge: they have to have knowledge on this as concerns data management, blending, querying and manipulation.
• Predictive analytics: this considers basic descriptive statistics, advanced analytics and predictive modeling.
• Presentation skill: this entails insight presentation, report design and data visualization.
• Big data knowledge: this entails machine learning, unstructured data analysis and big data analytics.
42. What is the KNN imputation method?
In KNN imputation, the missing attribute values are imputed through the use of the attributes value, which is most similar to the attribute whose value, is missing. Through using a distance function, the similarity of two attributes can then be determined.
43. What is the difference between factor analysis and principal component analysis?
• The objective of principal component analysis would be explaining the covariance between the variables though the aim of factor analysis is explaining the variance between the variables.
• When it comes to principal components analysis, the components are calculated as a linear combination of the raw input variables. In factor analysis, the raw input variables would be defined as linear combinations of the particular factors.
• PCA is utilized when there is a need to reduce the number of variables though FA is utilized when there is a need to group the variables according to some factors.
• The main idea when it comes to PCA would be explaining as much of the total variance within the variables as possible. Though the factor analysis shows the correlations or co-variances between the different variables.
44. Why is ‘naïve Bayes’ naïve?
It is naïve as it assumes that all of the appropriate datasets are equally the same and are independent which is not the case when it comes to a real world scenario.
45. How does one statistically compare means between the groups?
• It would be advisable to use an Independent T-test when a continuous variable and a categorical variable have two independent categories.
• It would be better to utilize the paired T-test when a continuous variable and a categorical variable with two dependent or paired categories.
• Use one way ANOVA during the time a continuous variable and a categorical variable have more than two independent categories.
• Use the GLM repeated measures at the time when a continuous variable and a categorical variable as opposed to two dependent categories.
46. Give a definition for homoscedasticity?
Within a linear regression model, there ought to be homogeneity of variance concerning the residuals. The variance of the residuals is almost the same for all of the predicted dependent variable type values.
47. Give a difference between the mean, mode and the median.
The mean is calculated by the sum of every value in the list and then dividing that by the number of observations. The mode is the most occurring value, which is in the list while the median refers to the middle value.
48. Which types of data are appropriate for the median, mode or the mean?
The mean is best used for the continuous type data that does not have any outliers. It is affected by the extreme type values, which are outliers. The mode is suitable for the type of categorical data, which belongs to nominal or ordinal scales. The median is best suited for the continuous data that has outliers or ordinal data.
49. What is the difference between stratified and cluster type of sampling?
The main difference between stratified and cluster sampling would be that when it comes to cluster sampling one proceeds through selecting clusters at random and then sampling each of the clusters or doing a census within the cluster. Though not all of the clusters ought to be selected. When it comes to stratified sampling, all of the strata should be sampled.
50. What is the p-value?
This is the lowest level of significance at which one would reject the null hypothesis. If the p-value happens to be less than 0.05 then you would reject the null hypothesis at the 5 percent level of significance.
51. What are eigenvalues and eigenvectors?
• The eigenvalue is a variance explained by principal components. The variances are the diagonal values of a covariance matrix. When the eigenvalue is greater than 1, it is considered to retain some components. This is because the average eigenvalue would be 1 so a figure of greater than 1 implies a higher than average.
• The eigenvector is the coefficient of the orthogonal transformation of the variables into principal components.
52. What is the default value of last parameter when it comes to VLOOPKUP?
It is a reference to finding the closest match and assuming the table is sorted in ascending order. In this case, FALSE/0 would refer to the exact match.
53. Does VLOOKUP refer the case-sensitive values?
It is not case sensitive. The text ‘can’ and ‘CAN’ are the same for VLOOKUP.
54. What is the main limitation of the VLOOKUP function?
The lookup value should be at the most left side column within the table array. VLOOKUP only seems right as it cannot look right to left.
55. What are the different types of sampling?
• Stratified sampling
• Cluster sampling
• Simple random sampling
• Systematic sampling