The main goal of statistical inference is learning from data. However, data we want to learn from are not always available/easy to handle. Imagine we want to know the average income of American women: it might be unfeasible or highly expensive to collect all American women’s income data. Besides, even in a scenario when this data collection was feasible, there might be imperfect compliance among interviewed women, since some of them might be not willing to share their data.

So the idea of statistical inference is to draw a sample from the population of our interest (in the above example, American women) and then, observing this sample, estimate population features (which are unknown parameters) using sample data. The resulting estimates of the true, unseen parameters are called ‘statistics’ and, in order to be reliable, they need to have some robustness’ characteristics:

- they have to be unbiased: the bias is defined as the difference between the expected value of the statistics and the true parameter. It might be sufficient to reach just the asymptotic unbiasedness, that means, having bias = 0 as n (sample size) tends towards infinite.
- they have to have the lowest variance possible. Indeed, between two unbiased estimators, the most efficient will be that with lowest variance. Thanks to the Cramer-Rao theorem, we can set a lower bound for the variance of any estimator so that, if our estimator is unbiased and has the variance equal to that ratio, we can conclude that this is the best one.

In this article, I’m going to provide an intuitive introduction to the Maximum Likelihood Estimation of a population parameter, showing that the final estimate is both (asymptotically) unbiased and the most efficient one (with lowest variance).

Let’s consider an iid (independent and identically distributed) sample **X **= (X1, X2, …, Xn) drawn from a population characterized by a vector of parameters **θ **and probability density function f(**X**; **θ**). Note that each component of our sample **X **is itself a random variable. Then, once the sample is realized, we are able to observe a set of values of that sample, which will be x1, x2,…,xn.

Now, the idea of the MLE is that the best estimate of **θ **will be that value of **θ** such that the probability of observing that specific realization of our sample is maximized. In other words, we want a value for **θ **such that the joint probability of **X **given **θ **is maximized. So, calling L(**θ**) the maximum likelihood function, we have that:

Since our sample is iid, we can work out the joint probability simply as the product of f(xi, θ) for all i=1,…,n:

To get the very intuition of this procedure, let’s consider an example with a Bernoulli trial. Recall that a random variable X with a Bernoulli distribution with parameter p takes value=1 with probability p, while value=0 with probability (1-p). Hence, the probability of success (indicated with X=1) is equal to:

Which takes value p if X=1, (1-p) if X=0.

Imagine we have our random variable X ~ Bernoulli(p), with p=0.3. We then compute the probability of having 1 success out of 10 trials and, using the Binomial distribution, we obtain:

Now imagine another scenario, where we do not know the value of the parameter p, but we *observe *a sample realization of 5 successes out of 10 trials. Now the question is: given that 5 out of 10 trials were successful, which is the value of p which makes this situation most likely? The question is the result of the maximum likely estimation of p, given by the joint distribution of our single Bernoulli trial:

Let’s visualize the latter function on R, considering 10 trials and 5 observed successes:

```
L = function(p, n, s){
return((p^s)*(1-p)^(n-s))
}
p = seq(0, 1, 0.01)
plot(p, L(p, n=10, s=8), ylab = 'MLE function', type='l')
```

Now let’s retrieve the value of p corresponding to the maximum of L(p):

```
abline(v=p[which.max(L(p, n=10, s=5))], col="red")
```

The value of p which makes 5 successes out of 10 trials more likely is 0.5, and it makes intuitive sense. What if, instead of 5, we had observed 7 successes?

```
p = seq(0, 1, 0.01)
plot(p, L(p, n=10, s=7), ylab = 'MLE function', type='l')
abline(v=p[which.max(L(p, n=10, s=7))], col="red")
```

Now let’s see the different shapes of our L(p) function, keeping n fixed and changing the number of observed successes:

```
plot(p, L(p, n=10, s=2), ylab = 'MLE function', type='l', col="red")
lines(p, L(p, n=10, s=4), ylab = 'MLE function', type='l', col="blue")
lines(p, L(p, n=10, s=6), ylab = 'MLE function', type='l', col="green")
lines(p, L(p, n=10, s=8), ylab = 'MLE function', type='l', col="yellow")
legend(0, 0.006, legend=c("s=2", "s=4", "s=6", "s=8"),col=c("red", "blue", "green", "yellow"), lty=1, cex=0.8)
```

As you can see, the values of the parameter p have been worked out ‘starting from the end’, that means, observing data and making guesses about the value of p which best describes the actual situation.

Finally, as anticipated, the MLE estimate of a given parameter θ, θ _mle, is asymptotically unbiased. Indeed, as n tends towards infinite, θ_mle converges in distribution to a Normal with mean θ and variance 1/In( θ ). The denominator of the variance has a powerful meaning, as it being the Fisher information, which is nothing but the denominator of the Cramér-Rao ratio. Hence, for an unbiased (at least asymptotically) estimator, the value 1/In( θ ) is equal to the Cramér-Rao ratio, hence it is the lowest possible variance.

So, not only our MLE estimator is asymptotically unbiased, but also it is the most efficient among all the unbiased estimator, since it has the lowest possible variance.

Hey Valentina!

I’m the editor at Data Driven Investor. I came across your blog because you published a few pieces with us on Medium. I would love to give you a feature on our website and promote your content to our 25,000 followers. Drop me an email and I will help you with onboarding! Looking forward to hearing from you.

LikeLiked by 1 person