Introduction to Statistical Inference
Statistics is a chancy business.
-----------------------------------------------------------------------------
------------------------------------------------
The key word, here, is uncertainty -- not uncertainty in asking the question, but rather the uncertainty in answering it. Uncertainty is why we have statistics.
The point is, we encounter uncertainty every moment of our lives. There's always doubt that we have accurate knowledge of things around us and that we can base actions on what we know.
What's more, something we call randomness seems to go with uncertainty. Not only are we uncertain of our knowledge, but some things appear to be capricious. There are times when random events occur seemingly for a reason, with something rational or causal behind the randomness. But other times we feel there is absolutely no rhyme or reason for happenings, that almost anything else might occur, that in fact the event are purely arbitrary. So we admit to both uncertainty and pure randomness.
This is the stuff of statistics. The fact is, statistics is a collection of symbolic tools dreamed up by concerned citizens to try to deal with the uncertainty. And randomness is the fundamental concept underlying use of the tools. Without the thought of a random event, there would be no statistics. It is created when pure chance is added to rational relationships to try to estimate the state of things, present or future.
-----------------------------------------------
Study the history of human thought and you'll realize there are two widely different views on the ways of the world, on how or why things happen. One extreme says that everything that occurs is completely predictable, or at least that it would be if we had all the data and were privy to all the knowledge of the world. Nothing is left to chance. The fact that we don't know everything is irrelevant. It only says we're ignorant -- not that the world isn't knowable.
The other extreme claims that nothing in the world is knowable, that things occur on a purely chance basis. Not because we are finite creatures, incapable of knowing, but because things are inherently random. You can read all the books you want, run all the experiments you can think of. It doesn't matter. You will never acquire any meaningful data. You won't because the whole business is purely random, absolutely unknowable, and unpredictable. Thinking we can know anything is to live in a fool's paradise.
These days not many people fall in either camp. We tend to cluster in middle ground, if we think about it at all. We believe there are things we can and do know, at least with a certain degree of conviction, and we also think some occurrences are arbitrary and essentially unknowable and unpredictable.
Statisticians, for the most part, are believers both in determinism and chance, at least where knowledge is concerned. They have to be. Indeed, without some element of determination, chance would rule the world and there would be no basis for testing hypotheses about it. There would be no confidence to any degree about any knowledge we might claim to have. Statistics would then cease to have any utility.
---------------------------------------------
To understand statistics you have to understand the difference between a population and a sample of the population. It's a matter of differentiating between the whole and a part, or between all of something and only some of it. It's about real world phenomena and how we come to learn about the phenomena. And it's your choice what to study. It's also about self-referencing.
For instance, you might wish to know the average height of the people in your town or village. To find out, we could go from house to house and measure the height of each and every occupant. Imagine doing that for a whole town? By the time you finished, assuming you could catch everyone at home and they were all willing and able to accommodate you, many of the heights would have changed, especially those for the young children, who grow like weeds. It wouldn't be feasible. Not to mention it would cost a fortune. To make the plan work, you have to resort to other strategems -- slick tricks of the trade. One trick is to use samples and then infer from the sample data what the average is for the whole population.
A convenient sample of the population of heights in the town might be the heights of all of the people on your block. There might be thirty people, or there might be fifty, or more. In other words, samples can come in different sizes. But no matter how big or small, they are still only part of the whole population. What you want is a sample that reflects the true state of the population.
Using the sample data from your block you can now calculate the average of the sample heights and then use that sample average to "guesstimate" the average height of the population.
------------------------------------------------
Once you admit chance into your scheme of things, you can include probability, because it provides a measure for the chance occurrences. Not all events happen with the same degree of chance -- some things are more likely to occur than others. Getting a "heads" in the toss of a coin has one probability, drawing the ace of diamonds in a deck of cards has another. That's the underlying idea. And that's why Card Games are so much fun. Poker, especially. Different situations involve different probabilities and call for different betting strategies. Probability theory is the aggregate of the statements you can make regarding chance, and it tells you just how likely something is to occur, or not. It helps decide when and how much to bet.
Probability and statistics are therefore intimately related. In statistics you draw samples from populations whose members are distributed in certain ways. For instance, in your town there may be more Fords than Pontiacs and more Pontiacs than Cads. The cars would therefore appear in different proportions of the total number of cars in the town. Thus they would have different frequencies of distribution or different probabilities of occurring. Other populations would likely have different probability distribution over their membership.
Sometimes you know what the distribution is, sometimes you don't. Or you know only certain aspects of it. In the population of heights of people in your town, for example, the heights might range from small amounts to fairly large amounts -- say from about 20 inches, for babies, to about 80 inches for the tallest adults.
Generally there won't be many very short individuals, babies or not, and usually there won't be many very tall persons, basketball players or not. Most of us fall in between, perhaps in the range 40 to 60 inches. Another way of saying this is that a small percentage of people will be short, another small percentage will be tall, and the largest percentage will be average. The frequencies will therefore have a certain distribution. Over the population.
It's the ratios or the percentages that determine the probabilities. A small percentage gives a small probability. A large percentage leads to a high probability. Since we're always dealing with percentages, too, there has to be a lower limit to the value of the probability, and there also has to be an upper limit. The lower limit is clearly zero. You can't have less than a zero percentage of something. But you also can't have a greater than one hundred percent of something. The hundred percent refers to the whole of something -- to its unity. So it goes to a probability of one.
------------------------------------------------
It's in the nature of statistics to draw inferences about the world when chance is involved and for statisticians to draw the inferences -- to gain knowledge about matters of interest, at least to a specified level of confidence.
Inferences are usually drawn by first stating a theory about some aspect of the world, then basing conclusions on statistical analysis. To say you have a theory about how some things come to pass is to claim there's a causal basis for the happenings. The things occur, not by chance, but rather by some physical or mechanical determination. The man fell because he lost his balance and gravity pulled him down. Or the water came out of the tap because of water pressure. Or the car rolled up the hill because four big guys pushed it. These things didn't happen by chance -- they were caused to happen.
Unfortunately, you can't always get clean data about the possible happenings. Randomness gets mixed up with causality. So you have to separate the random from the causal. When testing your theories you separate the purely chance aspects from the deterministic aspect, if you can.
In hypothesis testing, the idea is to construct a statistical statement contradicting your theory. You do this by denying the theory. You assume the theory is false. Your theory might be that more cars pass your house on weekdays than on the weekend -- implying there's a reason for the difference and it's not just chance. In this case you assume there is no causal basis and state in your statistical hypothesis -- the so-called null hypothesis -- that, on average, the number of cars passing your house on weekdays is the same as the number passing on weekends. The hypothesis states, in effect, that there is no reason for the difference in the occurrence, that it's only a chance thing.
So you have to show that, if more than chance is involved, the null hypothesis has to be false. But in statistics you can only show something is false up to a certain level of confidence, as you specify in the experiment. In statistics you have to show that your observations -- samples -- lead to the conclusion based on one or another level of confidence. Should your sample yield data that falls outside the constructed confidence interval (meaning that it has a low probability of occurring), you can argue that the likelihood of occurring is too small realistically to get such a sample. Therefore the null hypothesis must be wrong, implying that your theory was right. You could be wrong by the small probability that the occurrence was due to chance, but you will more likely be right. That's the confidence factor.
---------------------------------------------