May 17, 2022

Quantitative Research, Descriptive Analysis II

Summery

This article uses datasets sample "gss.sav" to show how to obtain descriptive statistics, with footnotes explaining the output. Our primary focus is on the gender and income level variables on datasets to find the meaning of the Mean, Median, Quartiles, Range, Variance, Standard deviation, Skewness, and Kurtosis and compare them as much as we can. There are varieties of commands in SPSS to get descriptive statistics for our experiments; however, we will go with the descriptive section and examine commands for most of our outcomes here.

So, we open the dataset in SPSS, go to the analysis, Descriptive Analysis, and choose our designated fields (sex and rincdol). We choose all the descriptive options we need in our report and run the command afterward. Figure 1 the first run results, and we will detail the outcomes. (Full report attached.) The report is based on 54.3% females versus 45.7% of males (Table 1).

Respondent's sex

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid

Male

45.7

45.7

45.7

 

Female

54.3

54.3

100.0

 

Total

100.0

100.0

 








Table 1, people’s report.

Mean, Median, Quartiles, Variance, Standard deviation, Skewness, and Kurtosis

Mean is an arithmetic number across our observations and experiment. This value shows and measures our work's central tendency, commonly called "average." However, Mean is very sensitive for big data and extremely large or small values. Median, however, measures or splits our distribution to "half of all values are above this value, and half are below." The Mean (average) of a set of data is measured by "adding all numbers in the data set and then dividing by the number of values in the set." (Khan Academy, 2020, para. 1). But the median is the middle value of an ordered data set sorted from least to most significant.

The report shows that, Figure 1, it is not normally distributed, and it is skewed to the left with a long left tail. The Mean also is negative and left of the peak. Therefore, as the report shows, both mean, median, and mode are fallen to the left of the histogram. The reason that we have left-skewed data in our report is as a result of the lower boundary in our dataset with the high incomes. Another reason for our left-skewed is that we have skewness startup effects on our data set on SEX vs. INCOME. In other words, we have lots of high income at the start-up of our data. The gender report, however, suggests that it is normally distributed (Figure 5)

Figure 1, income report.

 

Quartiles is measured by three given parts, the Lower Quartile (Q1), Middle Quartile (Q2), Upper Quartile (Q3 ). But, the Quartiles formula needs another value called Interquartile Range (Q3 – Q1.) So, "the quartile measures the spread of values above and below the mean by dividing the distribution into four groups." (Quartile, 2021, p. 1)

 Variance is "a measure of variability." (Frost, 2022) that measures the "amount of dispersion in a dataset." (Frost, 2022) So, Variance is the total sum of the squared distances of data values from the Mean divided by the variance divisor. Standard deviation, however, is the "spread of a group of numbers from the Mean." and is the square root of the Variance. Figure 7 shows the density of a normal distribution with the same Mean but different variability. So, the difference between these two is, "We don't generally use Variance as an index of spread because it is in squared units. Instead, we use standard deviation." The standard deviation has been affected in our experiment because of a strongly skewed to the left. Therefore the typical range based on the mean and standard deviation is not distributed normally for incomes. A very small number of people with high income increased the Mean. High Mean indicated that a big group of people made less than $40,000 per year in our experiment.

On the other hand, the data increased the Standard deviation because of a small group of people with a significantly higher income than other significant portion of people. This data concludes that it will mislead the impression of the population's real income in our experiment.

Skewness in a symmetric distribution is the "degree and direction of asymmetry." (NIST, 2020), and it is the measure of "symmetry, or more precisely, the lack of symmetry." Kurtosis is a "measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution." (NIST, 2020) We can see the Skewness and its standard error in our reports and the Kurtosis and its standard error. As we mentioned above, reports show that our Skewness and Kurtosis are both very close to zero for the SEX, but it's exceptionally high for the INCOME variable because the INCOME values in the dataset have heavy tails and outliers to the left.

Box plots and stem-and-leaf plots

We created a full report including histograms with the normal curve, box plots, and stem-and-leaf plots for each group and attached it to the project's submission. Figure 2 and Figure 3 show the respondent’s SEX and INCOME reports. The charts and data shows that our report is not normally distributed, and we have extreme high income at some points of 15, 16, and 18.  We made a Correlations report based on SEX and INCOME as we have one scale variable (SEX) when the other scale variable (INCOME) is changing. So, we got the report from Analyze, Correlate, and Bivariate (Figure 8.)

Figure 8. Correlation report #1.

Our reports analyze the categorical variable SEX against the Scale variable Income. So this kind of measure procedure is called Mean Procedure. This research aims to compare the summary statistics, Mean, Standard deviation, and Variations for the scale variable across categories of the categorical variable.

To compare the Mean and deviations, we made another report by going to Analyze, Compare Means, and Means. We add the SEX and INCOME to the report, with a dependent list and layers. We measured the Categorical Cross-tabulation report between SEX and INCOME in another try. The Cross-tabulation report gave us a detailed report of People's Income based on their SEX. 


Figure 9. Cross-tabulation report on Respondent’s SEX vs. INCOME.

We ran another descriptive report, Figure 9, on SEX and INCOME in a Cross-table manner, and the result was interesting, as it shows a clear relationship between these two.

 

References

Khan Academy. (2020). Statistics intro: Mean, median, & mode. Khan Academy. Retrieved 2022, from https://www.khanacademy.org/math/cc-sixth-grade-math/cc-6th-data-statistics/mean-and-median/v/statistics-intro-mean-median-and-mode

Frost, J. (2022, March 13). Measures of Variability: Range, Interquartile Range, Variance, and Standard Deviation. Statistics By Jim. Retrieved 2022, from https://statisticsbyjim.com/basics/variability-range-interquartile-variance-standard-deviation/

NIST. (2020). 1.3.5.11. Measures of Skewness and Kurtosis. NIST. Retrieved 2022, from https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

Quartile. (2021, October 7). Investopedia. Retrieved 2022, from https://www.investopedia.com/terms/q/quartile.asp

  

Diagrams, Tables, and Definitions

Figure 2, The first run results, and we will detail the outcomes. (Full report attached.)

 

 

No comments:

Post a Comment

Big Data migrates to hybrid and multi-cloud environment

 IDC research predicts that the Global Datasphere will grow to 175 Zettabytes by 2025, and China's data sphere is on pace to become th...