@maple2015

One more time: what do you mean by "the best statistical model".

Mac Dude here gave you some hints but, in essence, he didn't say anything else ...

Let's do the things simple: if you have N observations (17 in your case), then there exist an infinity of models with 17 parametrers that fit perfectly the data. In some sense theu are all "best statistical models".

One says that the "representation error" of all these models is 0.

Suppose you take one of them M :(X,Y) --> Z.

For all n in {1..N} you have M(X[n], Y[n]) =Z[n] ... but what is the prediction error of M at some other point (X', Y') not in

{(X[n], Y[n]), n=1..N} ???

This last error is also termed "generalization error" ans describes the capability of the model M to predict correctly out of the sampling points (at least in some neighborhoud of heir convex hull).

The most common practice is to accept as "a best model" a model M that realizes some balance between the representation error (RE) and the generalization error. Models with high number of parameters generally give small values of RE and high values of GE: they are called over-fitted models or over-learned models and systematically given up.

One generally prefer good prediction capabilities (low GE) instead of a good representation of the data (which means nothing when they are uncertain values, for instance measurement values).

The **simplest** way (I do not say it is the **best**) to assess GE is to use the Leave-One-Out strategy.

- Let S your original sample and S(-n) this sample where the n th observation X[n], Y[n]) has been dropped out.
- Build the model M(-n) over S(-n) (just take Statistical:-Fit has you did)
- Compute M(X[n], Y[n]): a quick estimation of GE is GE(-n) = M(X[n], Y[n]) - Z[n]

Repeating this sequence N times will give point-wise estimators GE(-1), ..., GE(-N)

The variance of these N estimators is an estimator of the generalization (prediction) error of the model M fitted over all the sample S.

More sophisticated approaches are Leave-K-Out, Bootstrap (look to the statistical package for more details: let me know if you are interested in bootstraping a statistical model), Cross-Vallisation ... (the meaning of all these strategies can be easily found on Wikipedia).

Last but not least ...

Let's take a simpler model with only one regressor X and a single dependant variable Z.

When the experimental observation of Z is submitted to some measurement/observation error U, the statistical approach begins by writing some a priori model Z = M[P](X) + U where P is the set of parameters of M

(it's a little bit more subtle than that because the you should write that the "Conditional expectation of Z given X = x is equal tp M[P](x) and the conditional variance of Z given X = x is equal to the variance of U at X = x)

Fitting the parameters P by mean of the least squares method, for instance, means that your are looking some specific value p* of P that has some particular property (here minimizing the sum of the residuals).

Then M[p*] is the model that minimizes the representation error RE over the class M[P] of models

But this doesn't makes M[p*] the best model in many other senses.

Consider this example

- The sample S is made of 2 points (X[1], Z[1]) and (X[2], Z[2])
- M[P] : X --> a*X+b
- Obviously the best "least squares" model verifies a*=(Z[2]-Z[1])/(X[2]-X[1]) and b*=(X[1]*Z[2]-X[2]*Z[1])/(X[2]-X[1])

For many people this model is the best for this couple (a*, b*) maximizes some likelihood function over S.

But this is true only when the likelihood function is a function of (M[a, b](X[1])-Z[1])^2+(M[a, b](X[2])-Z[2])^2; that is when U has a gaussian stationnary distribution.

Consider the counter example when U is Uniform on the interval [-h, +h]

Then all the couples (a, b) such that the graph (straight line) of M[a,b](x) goes through the two "boxes"

(x=X[n], z in [Z[n]-h, Z[n]+h]) have exactly the same value of the likelihood function.

For instance, from the "maximization of the likelihood point of view", the models

a=((Z[2]+h)-(Z[1]-h))/(X[2]-X[1]) and b*=(X[1]*(Z[2]+h)-X[2]*(Z[1]-h))/(X[2]-X[1])

a=((Z[2]-h)-(Z[1]-h))/(X[2]-X[1]) and b*=(X[1]*(Z[2]-h)-X[2]*(Z[1]-h))/(X[2]-X[1])

a=((Z[2]+h)-(Z[1]+h))/(X[2]-X[1]) and b*=(X[1]*(Z[2]+h)-X[2]*(Z[1]+h))/(X[2]-X[1])

....

are as best as the model (a*, b*) is