Wednesday, December 27, 2017

Sign Test for Small samples

Sign Test for Small samples


Why Sign test?

In statistics we usually deals with parametric tests. To conduct a parametric test, certain assumptions have to be satisfied. Such as the normality assumption. But sometimes, these assumptions are violated. In that case we will use the corresponding non-parametric alternatives.

Actually non parametric tests can be used every time instead of parametric tests. But due to the popularity of the parametric tests, we usually use non-parametric tests, when the assumptions of the parametric tests are violated.

Sign test is the non-parametric alternative for the one sample T test. In one sample T test, we test for the population mean. But in sign test, we test for the population median.

When to use the Sign test?
When the sample size is large (greater than 30), we can use the one sample t test. Because using the central limit theorem we can say that the distribution of the sample mean is approximately normally distributed.  To use the one sample T test when the sample size is small, the corresponding population should be normally distributed. But what if the population that the sample was taken is not normally distributed?

In this situation, we can use the sign test as the assumptions of the one sample T test has violated.

Consider the following example. The monthly income of 6 randomly chosen students are $1200, $750, $1250, $950, $1050 and $1450 respectively. Is there evidence that the median income of students is more than $950?

Following steps should follow in sign test

Step 1 : Identify the null and alternative hypothesis


Step 2: Calculate the test statistic.

The test statistic C is the number of values (+ values) greater than median. To do this, first arrange the data in either ascending or descending order. So in this example the ascending order is $750, $950, $1050, $1200, $1250 and $1450.

The values which are greater than median are + values. And the values which are less than median are – values. The values that are exactly equal to the median should be ignored. The sample size n* should be the all values excluding the number of observations which are equal to the median.

So in this example, n* = 5 and C = 4

The test statistic C has the following distribution under the null hypothesis.

C ~ Bin (n*, 0.5)

Step 2: Calculate the P value.

P value = P(C >=4) .
So P value = P(C=4) + P(C=5). To calculate this you can use binomial table or manually calculate each probability.


So p value is =0.1875.

So the P value is greater than 0.05. Therefore the null hypothesis is not rejected.

This is how we can conduct sign test manually. In next post we will discuss how to do this using minitab.



Tuesday, December 26, 2017

How to tackle a Type II error homework Question

Type II Error


In Hypothesis testing, there are two types of errors that can occur. They are Type I error and Type II error. Type I error is the probability of erroneously rejecting the null hypothesis. That means, reject the null hypothesis when it is true. Type II error is  the probability of erroneously accepting the null hypothesis. That means, do not reject the null hypothesis when it is false.

Here is an example of how to tackle a question which deals type II error.

(Question) A study is done to see if the average age a ‘child’ moves permanently out of his parents’ home in the United States is at most 23. 43 U.S Adults, all age 40, were surveyed. Then sample average age was 24.2 with a standard deviation of 3.7. Which is the type II error?

  • Conclude that the average age is greater than 23, when it is at most 23.
  • Conclude that the average age is greater than 23, when it is 24.2
  • Conclude that the average age is at most 24.2, when it is at most 24.2
  •       Conclude that the average age is at most 23, when it is greater than 23



(Explanation) Type II error is the probability that not rejecting the null hypothesis when it is false. Here the null hypothesis is that the average age a ‘child’ moves permanently out of his parents’ home in the United States is at most 23. So type II error would be  concluding that the average age is at most 23 (not rejecting the null hypothesis) when it is greater than 23 (when it is false). So the correct answer is the last option.






Monday, December 18, 2017

Dealing with a simple linear regression question using Minitab

Today I am going to discuss about the steps that you need to follow of a standard linear regression question using MINITAB.
This will be a very good guide as most of the elementary statistics classes may teach linear regression. And using minitab, you can solve questions very quickly so that you can save time. The question is as follows. Here  I am using MINITAB 17. I believe that the menu bars of every version is relatively same. If you find it difficult to find, please send me a quick message and i will help you. 

Ex: The accompanying table shows the total square footages (in billions) of retailing space at shopping centers, the numbers (in thousands) of shopping centers, and the sales (in billions of dollars) for shopping centers for eight years.







Identify the independent and dependent variables

This is the first step. In most of the questions, it has specified which variables are independent and which is the dependent. If that is not the case, you should identify it by yourself. For example , in the above question you will be clearly see that the sales will depend on the total square footage and the number of shopping centers.

Fitting the regression line

After identifying the dependent and independent variables, the next step is to fit the regression model. since you are using a computer software (in this case minitab) this is relatively easy. But you should know the steps. 

First import the data to minitab. you can do this very easily. if you using websites like "pearson mystatlab" you can easily copy data to minitab.

Then go to StatRegressionFit Regression Model

Then you will get a window like below.

Under the Responses, drag and drop the dependent variable. Also under Continuous predictors, put the independent variables like in the above window.

since you are fitting a basic model , you don't need to do anything. (So in the questions in Pearson Mystatlab, you dont need to do anything else). Then press OK.

Then in the MINITAB Session window, you will see a output like below.


This output will contain the everything that you need  to answer the questions. Lets analyze the output.


  • In the Analysis of Variance section you will see the partitions of sums of squares.
  • The overall F statistic for the regression model is 350.98 and it is significant as the p value is 0.000.
  • The sums of squares of Regression ,SSR(773075 , with degrees of freedom 2) can be partitioned in to two  parts due to two independent variables. Basically the the degrees of freedom of the SSR equals to the number of independent variables. (in this case 2).
  • The sums of squares of  error ,SSE is 8810 with degrees of freedom 8. The formula for the degrees of freedom of SSE is n-p, where p is the number of parameters.  


        No of parameters = No of independent variables +1

so in this example , it is 3. so the degrees of freedom of SSE is 11-3 =8.

  • The sums of squares of total, SST is 781885.
  • The s in the output refers to the standard error of the regression model, which is also equal to the square root of the MSE.
  • The R squared value  and the R squared adjusted values of the model is 98.87% and 98.59% respectively. According to the R squared , about 98.87% of the total variation in y can be explained using the fitted regression model.
  • Using the coefficients table , the significance of the each coefficient can be determined. This is same as the partial F statistic of the analysis of variance table.Here we have to check the p value of the each variable. we can see that the p value of the x2 variable is large. so it is not significant to the model.
  • The final model is, 
        Sales = -224.0 + 50.5*Square Footage  + 20.14 *Shopping centers