# Transformations. Non-parametric - Transformations. Non-parametric... Transformations....

date post

29-Nov-2018Category

## Documents

view

227download

2

Embed Size (px)

### Transcript of Transformations. Non-parametric - Transformations. Non-parametric... Transformations....

Transformations. Non-parametrictests

Jon Michael GranDepartment of Biostatistics, UiO

MF9130 Introductory course in statisticsMonday 06.06.2011

1 / 40

Overview

Transformations. Non-parametric methods(Aalen chapter 8.8, Kirkwood and Sterne chapter 13 and 30.2)

Logaritmic transformation Choice of transformation Non-parametric methods based on ranks (Wilcoxon signedrank test and Wilcoxon rank sum test)

2 / 40

Transformations

Motivation Non-normality means that the standard methods cannot beused

This includes linear regression, which is often a main tool ofanalysis major problem

For data that are skewed to the right, one can often uselog-transformations

This does not only give normally distributed data; it may alsogive equal variances in different groups

3 / 40

Logaritmic transformation

So what do we mean by doing a log transformation? Take the logaritm of all your data values xi for all i s, and doyour analysis on the new dataset of ui s, where ui = ln(xi)

4 / 40

Example: Ln transformation

Skew distribution. Example of observations: 0.40, 0.96, 11.0

Ln transformed distribution: ln(0.4) = 0.91,ln(0.96) = 0.04, ln(11) = 2.40

Do analysis on the ln transformed data In SPSS: transform compute

5 / 40

Example: Ln transformation

Skew distribution. Example of observations: 0.40, 0.96, 11.0 Ln transformed distribution: ln(0.4) = 0.91,ln(0.96) = 0.04, ln(11) = 2.40

Do analysis on the ln transformed data In SPSS: transform compute

5 / 40

Example: Ln transformation

Skew distribution. Example of observations: 0.40, 0.96, 11.0 Ln transformed distribution: ln(0.4) = 0.91,ln(0.96) = 0.04, ln(11) = 2.40

Do analysis on the ln transformed data In SPSS: transform compute

5 / 40

Choice of transformations The logarthmic transformation is by far the most frequentlyapplied

Appropriate for removing positive skewness (data skew to theright)

There are other types of transformation for data that arestronger or weeker skewed, or data skew to the left

6 / 40

Other transformations Skewed to the right:

I Lognormal: Logarithmic (u = ln x)I More skewed than lognormal: Reciprocal (u = 1/x)I Less skewed than lognormal: Square root (u =

x)

Skew to the left:I Moderatly skewed: Square (u = x2)I More skewed: Cubic (u = x3)

Non-linear relationship: Transform only one of the twovariables

7 / 40

Non-parametric statistics

Motivation So what if transforming our data doesnt help and it is stillnot normally distributed?

Use non-parametric test

8 / 40

Non-parametric tests In tests we have done so far, the null hypothesis has alwaysbeen a stochastic model with a few parameters. Models basedon for example:

I Normal distributionI T-distributionI Binomial distribution

In nonparametric tests, the null hypothesis is not a parametricdistribution, rather a much larger class of possible distributions

9 / 40

Parametric methods we have seen Estimation

I Confidence interval for I Confidence interval for 1 2

TestingI One sample T-testI Two sample T-test

The methods are based on the assumption of normallydistributed data (or normally distributed mean)

10 / 40

Typical assumptions for parametric methods

1 Independence: All observations are independent. Achieved bytaking random samples of individuals; for paired t-testindependence is achieved by using the difference betweenmeasurements

2 Normally distributed data (Check: histograms, tests fornormal distribution, Q-Q plots)

3 Equal variance or standard deviations in the groups

11 / 40

How to test for normality

Visual methods, like histograms and Q-Q plots, are veryuseful

Also have several statistical tests for normality:I Kolmogorov-Smirnov testI Shapiro-Wilk

However, with enough data, visual methods are more useful!

12 / 40

Tests for normality

Example of SPSS output from a Kolmogorov-Smirnov test:

(measurements of lung function, separated by gender) According to this test, if the p-value is less than 0.05 the datacannot be considered as Normally distributed

Note though that these tests are a bit problematic.Theyre not very effective at discovering departures fromnormality. The power is low, so even if you dont get asignificant result its a bit of a leap to assume normality

13 / 40

Tests for normality

Example of SPSS output from a Kolmogorov-Smirnov test:

(measurements of lung function, separated by gender) According to this test, if the p-value is less than 0.05 the datacannot be considered as Normally distributed

Note though that these tests are a bit problematic.Theyre not very effective at discovering departures fromnormality. The power is low, so even if you dont get asignificant result its a bit of a leap to assume normality

13 / 40

Q-Q plots

Graphical way of comparing two distributions, plotting theirquantiles against each other

Q for quantile If the distributions are similar, they Q-Q plot should be closeto a straight line

Q-Q Q-Q

Heavy tailed population Skew population

14 / 40

Histogram and Q-Q plot for Gender = 1

15 / 40

Histogram and Q-Q plot for Gender = 2

16 / 40

Non-parametric statistics

The null hypothesis is for example that the median of thedistribution is zero

A test statistic can be formulated, so thatI it has a known distribution under this hypothesisI it has more extreme values under alternative hypotheses

17 / 40

Non-parametric methods Estimation

I Confidence interval for the median Testing

I Paired data: Sign test and Wilcoxon signed rank testI Two independent samples: Mann-Whitney test/Wilcoxon

rank-sum test Make (almost) no assumptions regarding distribution

18 / 40

Example: Confidence interval for the median

Beta-endorphin concentrations (pmol/l) in 11 individuals whocollapsed during a half-marathon (in increasing order): 66.071.2 83.0 83.6 101.0 107.6 122.0 143.0 160.0 177.0 414.0

We find that the median is 107.6 What is the 95% confidence interval?

19 / 40

Confidence interval for the median in SPSS Use ratio statistics, which is meant for the ratio between twovariables. Make a variable that has value 1 for all data (call itunit)

Analyze Descriptive statistics RatiosNumerator: betae Denominator: unit

Click Statistics and under Central tendency check Median andConfidence Intervals

20 / 40

SPSS output

95% confidence interval for the median is (71.2, 177.0)

21 / 40

Non-parametric tests

Most tests are based on rank-sums, and not the observedvalues

The sums of the rank are assumed approximately normallydistributed, so we can use a normal approximation for thetests

If two or more values are equal, the tests use a mean rank

22 / 40

The sign test

Assume the null hypothesis is that the median is zero Given a sample from the distribution, there should be roughlythe same number of positive and negative values

More precisely, number of positive values should follow abinomial distribution with probability 0.5

When the sample is large, the binomial distribution can beapproximated with a normal distribution

Sign test assumes independent observations, no assumptionsabout the distribution

SPSS: Analyze nonparametric tests two related samples.Choose sign under test type

23 / 40

Example: Energy intake kJ

Want to test if energy intake is different before and aftermenstruation.

H0: Median difference = 0H1: Median difference 6= 0

All differences are positive

24 / 40

Using the sign test

The number of positive differences should follow a binomialdistribution with P = 0.5 if H0: median difference betweenpremenst and postmenst is 0

What is the probability of observing 11 positive differences? Let P(x) count the number of positive signs P(x) is Bin(n = 11,P = 0.5) The p-value of the test is the probability of observing 11positives or something more extreme if H0 is true, henceP(x 11) = P(x = 11) = 11!0!(110)!0.5

0(1 0.5)11 = 0.0005 However, because of two-sided test, the p-value is 0.0005 2= 0.001

Clear rejection of H0 on significance-level 0.05

25 / 40

Using the sign test

The number of positive differences should follow a binomialdistribution with P = 0.5 if H0: median difference betweenpremenst and postmenst is 0

What is the probability of observing 11 positive differences?

Let P(x) count the number of positive signs P(x) is Bin(n = 11,P = 0.5) The p-value of the test is the probability of observing 11positives or something more extreme if H0 is true, henceP(x 11) = P(x = 11) = 11!0!(110)!0.5

0(1 0.5)11 = 0.0005 However, because of two-sided test, the p-value is 0.0005 2= 0.001

Clear rejection of H0 on significance-level 0.05

25 / 40

Using the sign test

The number of positive differences should follow a binomialdistribution with P = 0.5 if H0: median difference betweenpremenst and postmenst is 0

What is the probability of observing 11 positive differences? Let P(x) count the number of positive signs P(x) is Bin(n = 11,P = 0.5)

The p-value of the test is the probability of observing 11positives or something more extreme if H0 is true, henceP(x 11) = P(x = 11) = 11!0!(110)!0.5

0(1 0.5