# 5.4 Estimator, Bias and Variance¶

## Pointe estimator¶

Denote a point estimate of a parameter $$\hat{\theta}$$. Let $$\{x^{1} ... x^{m}\}$$ be a set of m independent and identically distributed data (iid). A point estimator or statistic is any function of the data:

$\hat{\theta}_m = g(x_1, x_2, x_3 ... x_m)$

Frequentist perspective on statistics: assume true parameter value :math: theta is fixed but unknown. The point estimate is a function of the data. Since data is drawn from random process, any function of the data is random. Therefore $$\hat{\theta}$$ is a random process.

### Bias¶

$bias(\hat{\theta}) = \boldsymbol{E}(\hat{\theta}) - \theta$

Expectation is over the data. :math: theta is true underlying value used to define the data generating distribution. :math: theta is unbiased when $$bias(\hat{\theta}) = 0$$. :math: theta is asymptotically nbiased when $$lim_{m\rightarrow \infty}bias(\hat{\theta}) = 0$$

See all the examples from p121 ~ p124 for Estimation mean, square …

### Variance¶

Variance of the estimator:

$var(\hat{\theta})$

Standar Error:

$\sqrt{var(\hat{\theta})}$

The variance provides a measure of how we would expect the estimate we compute from data to vary as we independently resample the dataset from the underlying data-generating process.

The standard error of mean $$SE(\hat{\mu}_m)=\sqrt{Var(\frac{1}{m} \sum_{i} x^i}) = \frac{\sigma}{\sqrt{m}}$$ is very useful because we often estimate the generalization error by computing sample mean of the error on the test set. See example of Bernoulli Distribution in P125

### Trading off Bias and Variance to Minimizing Mean Error¶

• Bias measures the expected deviation from the true value of the function or parameter.
• Variance provides a measure of teh deviation from the expected estimator value that any particular sampling of the data is likely to cause.

How do we compare a model with large bias and model with large variance? Most common way: use cross-validation. Alternatively, mean square error (MSE):

$MSE = \mathbb{E}[(\hat{\theta}_m - \theta)^2] = Bias(\hat{\theta}_m)^2 + Var(\hat{\theta}_m)$

To prove it, try it from right to left.

When generalization erro (defined in 5.2 in P107, as expected value of error on a new input) is measured by the MSE (where bias and variance are meaningful components of generalization error), increasing capacity tends to increase variance and decrease bais.