Univariate, Bivariate, and Multivariate Analysis Techniques for Exploratory Analysis
Exploratory Data Analysis is an integral part of working with data to to understand the trends, patterns, and relationships among various entities present in the data set. EDA can be carried out using one of three main techniques: univariate, bivariate, or multivariate analysis. We will examine each of these three approaches in this article
I am using titanic dataset in this article to perform Exploratory analysis
What is Univariate Analysis?
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. The goal of univariate analysis is to derive, define, and summarize data and analyze patterns within it. This is done by looking at the mean, median, mode, spread, variance, range, standard deviation, etc. Since Univariate analysis is the analysis of single variables
Univariate analysis is conducted in many ways and can be categorized into the following groups:
1. Graphical
2. Summary Statistics
3. Frequency Distributions
Graphical analysis
Various types of graphs can be used to understand data. The standard type of graphs include Barplots,Histograms, piechart, boxplots etc
Categorical Data can be visualized using Barplots, Piechart.
- Barplots
A bar chart plots is only applicable to the categorical data. A bar chart displays a set of categories in one axis and the percentage or frequencies of a variable for those categories in another axis
2. Histograms
Histograms display the counts of values falling in different class intervals or ranges.
3. PieChart
Univariate categorical data can be visualised using a pie chart, which represents data in a circle or sphere, with the circle representing the total data and the slices representing the individual data points.
Numerical Data can be visualized using Histograms, BoxPlot.
Histogram:
Box Plot
A boxplot also known as whisker plot displays a five number summary “minimum”, first quartile [Q1], median, third quartile [Q3] and “maximum”. A box is drawn from the first quartile to the third quartile in a box plot. At the median, a vertical line passes through the box. Each quartile’s whiskers lead to the minimum or maximum.
Summary Statistics
The most common way to perform the univariate analysis is to use summary statistics to describe a variable. There are two kinds of summary statistics:
Two types of summary statistics are frequently used:
Measures of central tendency: These numbers describe where the center of a dataset is located. Examples include the mean , median and mode.
Mean — It is the Average value of the data which is a division of sum of the values with the number of values.
Median — The median is the data’s 50th percentile, or the middle value in the data, which divides the distribution into two halves.
Mode — The only central tendency measure that works with categorical variables is mode. Mode represents the most frequent value of a variable in the data.
Frequency Distributions
frequency distribution is a tabular summary of data which describes how frequently different values occur in a dataset. The Frequency Distribution Analysis can be used for Categorical (qualitative) and Numerical (quantitative) data types
What is Bivariate Analysis?
Bivariate analysis simply implies analysing the relationship between two variables. These variables are usually denoted by X and Y. Bivariate studies are used to examine whether there is a statistical relationship between two variables, how strong that relationship is, and whether one variable can be predicted from another.
What are the types of bivariate analysis?
The kind of bivariate analysis is dependent on the kind of attributes and variables that is used to analyze the data. Generally we have two types of data numerical and Catergorical and hence we can perform data analysis on the below combination of data
1. numerical and numerical — In this bivariate correlation, both the variables have a numerical value.
2. Categorical and categorical — In this, both the variables have categorical value
3. Numerical and categorical — In this, one variable is numerical, and the other is categorical.
Numerical and Numerical data can be analysed using ScatterPlot
ScatterPlot — In a scatter plot, dots are used to show the values of two different numerical variables. Each dot’s position on the horizontal and vertical axes represents a data point’s values.
The main purposes of scatter plots are to examine and display correlations between two numerical variables.
Numerical and categorical data can be analysed using BoxPlot
BoxPlot
BoxPlots can be used in Bivariate analysis to find the relationship between numerical and categorical data
Multivaiate analysis
Multivariate analysis simply implies analysing the relationship between numerous variables(more than two). We can use pairplots and heatmap to visualize more than two variables.
Pairplots
Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical
Heat map
A heat map is a color-coded graphical representation of values in a grid. It follows a pair plot and represents the correlation coefficients of the pairs that measure the linear relationships.
The dark-red and dark-blue cells are traits that are highly connected. Values around 1 indicate a strong linear positive relationship, whereas values near -1 indicate a strong linear negative relationship.