Exploratory data analysis for HR analytics

Juilee Talele
7 min readOct 25, 2019

--

A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data. To do so we first need to torture the data and it will confess anything, in Data Science lingo this process is called as Exploratory data analysis (EDA).

Exploratory Data Analysis (EDA) is the process of visualizing and analyzing data to extract insights from it. In other words, EDA is the process of summarizing important characteristics of data in order to gain better understanding of the data set.

Steps of EDA:

1. Descriptive Statistics

2. Uni-variate Analysis and Bi-variate Analysis

3. Grouping the data

4. Missing value imputation

5. Correlation

You can download the data-set from this link:

https://www.kaggle.com/data/95997

This data set consist of the records of employee, their scores, salary and other information.

Objective: Analyse the effects of different parameter on the salary compensation of the employees.

Descriptive Statistics:

Descriptive statistics analysis helps to describe the basic features of dataset and obtain a brief summary of the data.

Let’s start by importing the libraries and reading the data,

After reading the data, we must know number of columns and rows in our data-set, for which we can use :

This shows we have 19148 rows and 20 columns i.e variable.

To check the null values present in every variable and the variable datatype,

Above figure shows variable name, number of records, non — null values and the datatype of the variable.

As you can see, variables college tier and city tier shows int64 datatype as it has numeric value 1 and 2, but we know logically its categorical variable. So we need to explicitly convert it into categorical form and recheck:

Now, let’s apply the describe() method over this dataset and see the results. It displays a description of mean, standard deviation, quartiles and maximum & minimum values.

From above table we find the average salary and scores of a employee.

As our target variable is salary it is important to know the average salary which is 6.27 L.

For numeric variable we can use describe() and for categorical data we can use value_counts().

We can get more description of the data by:

Function unique() gives the unique values present in every variable.

From above output table we can infer we have 2 genders, 2 city tier, 25 states, 3 graduation etc.

Univariate Analysis and Bivariate Analysis:

The data associated with each attribute includes a long list of values (both numeric and not), and having these values as a long series is not particularly useful yet — they don’t provide any standalone insight. In order to convert the raw data into information we can actually use, we need to summarize and then examine the variable’s distribution.

Univariate Analysis

The univariate distribution plots are graphs where we plot the histograms along with the estimated probability density function over the data. It’s one of the simplest techniques where we consider a single variable and observe its spread and statistical properties. The univariate analysis for numerical and categorical attributes are different.

For univariate analysis we can use histogram, box plot for numerical variable and bar plot categorical variable. We can also plot the distribution so as to observe symmetry and skewedness.

Below is the code with few diagrams,

Plot : Bar plot for categorical variable Graduation

Inference : Figure show the higher strength for the employee who have pursed B.Tech/B.E.

We will try same for variable Branch,

Let’s observe the distribution of variable salary,

Inference: As we can see there is tail towards the right, we can say our data is skewed. Also, we know the mean of the salary variable i.e. 6.27.

Let’s see for Box plot,

Box plot for variables Tenscore and Twscore,

Box plot shows us the median of the data, which represents where the middle data point is. The upper and lower quartiles represent the 75 and 25 percentile of the data respectively. The upper and lower extremes shows us the extreme ends of the distribution of our data. Finally, it also represents outliers, which occur outside the upper and lower extremes.

We can plot these plots for all variable to observe and get insight.

Grouping the data

For better analysis we need to merge the columns or group the data.

Like in our case, company conducted 5 test for soft skill Score in conscientiousness, Score in agreeableness, Score in extraversion, Score in neuroticism and Score in openess_to_experience.

But it makes no sense in analysing for these variable separately, so we merge these 5 variable into one variable name score in soft skill.

Similarly, for technical test company conducted 6 test. Thus we merge them and form new variable name Technical score.

From observation, there were few employee who did not attended the Technical test where we had missing value. But it makes no sense in imputing those missing value. Hence we can have separate analysis for those who attended the Technical test and those who didn’t attended the Technical test.

Bi-variate Analysis

The examination of two variable simultaneously is called Bi-variate analysis.

Let’s try bi-variate analysis for variable salary and graduation,

Inference: From above plot we can see, frequency of employee who pursed B.Tech/B.E i.e approx. >1400, from fig 1 have maximum salary compared to the employee who pursed M.Tech /M.E i.e approx. >120.

Let’s try one more bivariate analysis for city tier,

Here we plotted frequency distribution for Tier 1 and Tier 2.

Inference, plot show there is no much difference between the salary of the employee from city tier 1 and tier 2.

For more detailed information we can use crosstab() function shown below,

Box plot can also be used in bivariate analysis shown below,

Correlation

The correlation matrix

The correlation matrix is a table showing the value of the correlation coefficient (Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. ) between sets of variables. Each attribute of the dataset is compared with the other attributes to find out the correlation coefficient. This analysis allows you to see which pairs have the highest correlation, the pairs which are highly correlated represent the same variance of the dataset thus we can further analyse them to understand which attribute among the pairs are most significant for building the model.

We can see the correlation between different variables using the corr() function. Then we can plot a heatmap over this output to visualize the results.

From the above heatmap, english score and aptitude score are positively correlated (score 0.5) with each other while soft skill scores are less and negatively correlated with other variable (dark shaded blocks).

Also observe the below figure,

We can notice that Tenth, Twelfth & graduation scores are not going affect your Aptitude or Technical scores which are required for getting employment in most IT companies.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Juilee Talele
Juilee Talele

Written by Juilee Talele

Analyst (Customer Insight) at Capita India

No responses yet

Write a response