Tutorial on Data Science and Machine Learning
Types of Data
Discrete Numerical Data: This is Integer based. Example: Population of cities
Continuous Numerical Data: Infinite number of possible values ie., it may contain fractions. Example: Height of a person
Categorical Data: Doesn’t have inherent numeric meaning. Example: Gender, Product category
Ordinal Data: Mixure of Numerical and Categorical Example: Movie Ratings on a scale of 1-5
Mean, Median, Mode
Mean: This is average of all the sample values. Sum / Number of samples Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9 Mean => (2+5+3+7+4+8+4+1+9)/9 = 4.89
Median: The mid value of all the samples sorted Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9 Sorted: 1,2,3,4,4,5,7,8,9 Median is 4 If the number of sample values are even, then take the average of the two mid values Median is considered better than Mean when there are outliers in the samples as Mean would be skewed.
Mode: The sample value that has most number of occurances Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9 Mode: 4 as it is appearing two times where as all other sample values are appearing only once.
Mean, Median and Mode in Python
Program import numpy as np from scipy import stats as st age = [20, 40, 30, 60, 85, 64, 23, 56, 78, 56, 34, 56, 78, 34, 67, 65] print (“Mean:”,np.mean(age)) print (“Median:”, np.median(age))print (“Mode:”, st.mode(age)) Output Mean: 52.875 Median: 56.0 Mode: ModeResult(mode=array([56]), count=array([3]))Histogram
In a set of values, identify the frequency of occuring of each of the values and plot them as a bar graph by each bar representing frequency of each value. Ex: Below is the list of age of the people attending an event 25, 30, 45, 60, 25, 45, 34, 56, 25, 30, 30, 25, 45, 60, 25 You can make a table of ranges of ages and number of values in each range Age Range and Frequency 20-29: 5 30-39: 4 40-49: 3 50-59: 1 60-69: 2 If you plot these value on a bar graph, taking age range on X and Frquency on Y, that gives a histogram.Program: Plot a Histogram
import numpy as np import matplotlib.pyplot as plt #Generate ages instead of hard coding, we will get more meaningful values #40 is centered values #5 standard deviation #1000 number of values ages = np.random.normal(40, 5, 1000) plt.hist(ages, 50) plt.show() Output:Variance
According to wikipedia , It measures how far a set of (random) numbers are spread out from their average value. In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.
Standard Deviation
According to wikipedia , In statistics, the standard deviation (SD, also represented by the lower case Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. SD can also be treated as square root of Variance.
Program: Standard Deviation and Variance in Python
import numpy as np import matplotlib.pyplot as plt #Generate ages instead of hard coding, we will get more meaningful values #40 is centered value #5 standard deviation #1000 number of values ages = np.random.normal(40, 5, 1000) #Now ages is a list. How to get Standard deviation and variance of the list print(“Standard Deviation:”, ages.std()) print(“Variance:”, ages.var()) Output Standard Deviation: 5.24306013408 Variance: 27.4896795696Probability Density Function (PDF)
Referring to wikipedia: Probability of random variable for a speific value in continuous data is almost ‘0’. However, there will be a +ve value for probability of the random variabe falling within a particular range of values. Probability Density Function is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking any one value.
Probability Mass Function (PMF)
Referring to wikipedia: This is used for set of discrete values. Probability mass function gives probability that a discrete random variable is exactly to some value.
Draw a Uniform Distribution Curve
Program #Draw a uniform distribution import numpy as np import matplotlib.pyplot as plt #Get a list of random values and use uniform function #start value, end value and number of points values = np.random.uniform(-10.0, 20.0, 100000) #Plot a histogram with 50 bars plt.hist(values, 50) #Show the histogram plt.show() Output:Draw a Normal Distribution Curve: using Probability Density Function
Program from scipy.stats import norm import matplotlib.pyplot as plt import numpy as np #Get random values between -5 and 5 with interval of 0.1 x = np.arange(-5, 5, 0.1) #Use normal probability density function to get the histogram plt.plot(x, norm.pdf(x)) #Show the histogram plt.show() Output:Draw a Binomial Distribution Curve: using Probability Mass Function
Program #Binomial Distribution from scipy.stats import binom import matplotlib.pyplot as plt import numpy as np n, p = 10, 0.5 x = np.arange(0, 10, 0.001) plt.plot(x, binom.pmf(x, n, p)) plt.show() Output:Poisson Probability Mass Function
Program #A Restaurant gets 200 guests on average per day. #What is the probabiity of getting 220 on a day from scipy.stats import poisson import matplotlib.pyplot as plt import numpy as np mu = 200 x = np.arange(140, 270, 0.5) plt.plot(x, poisson.pmf(x, mu)) plt.show() Output:Percentile
A percentile (or a centile) is a measure used in statistics to indicate how much % is below a value. For example: A student got 80th peercentile score in an exam means, 80% of the students got score below that sutdent. 50th percentile is equalent to Median. That is the mid value among all.
Percentile in Python
Program import matplotlib.pyplot as plt import numpy as np vals = np.random.normal(50, 4, 10000) print (“50th percentile:”, np.percentile(vals,50)) print (“10th percentile:”, np.percentile(vals,10)) print (“90th percentile:”, np.percentile(vals,90)) Output: (this may vary as we used random values) 50th percentile: 50.0183934715 10th percentile: 44.8104131231 90th percentile: 55.2289663083Moments in Statistics
1st Moment is same as Mean
2nd Momemn is Variance
3rd Moment is Skew
4th Moment is Kurtosis
Skew and Kurtosis indicate shape and sharpness of the curve of a histogram.
Skew may be -ve or +ve.
Higher the kurtosis, sharper the curve
Comments are closed.