## Nov 1, 2013

### The nonlinear least square fit

The nonlinear least square fit (regression) minimizes the target action: $S(\theta) = \sum_i w_i (y_i - f(x_i; \theta))^2$ with positive weights $w_i \ge 0$, for a given set of data $\{ (x_1, y_1), (x_2,y_2),\cdots \}$ and a form function $f(x; \theta) \equiv f(x;\theta_1,\theta_2,\cdots \theta_p)$.

The minima lies in the extrema: $0 = \frac{\partial S}{\partial \theta_r} = -2 \sum_i w_i (y_i - f(x_i;\theta))\frac{\partial f(x_i;\theta)}{\partial \theta_r}. \quad r = 1,2,\cdots p$

### Gauss-Newton algorithm

The Newton's method for solving for a local root of a function $\varphi(x)$ is, $x_{n+1} = x_n - \frac{\varphi'(x_n)}{\varphi(x_n)}$ In this problem, $\varphi(\theta) = S'(\theta)$. Therefore, the Newton's method can be realized as $\theta^{(k+1)} = \theta^{(k)} - (H S(\theta^{(k)}))^{-1} \nabla S(\theta^{(k)})$ where $[H S(\theta)]_{rs} = {\partial^2 S}{\partial \theta_r \partial \theta_s}$ is the Hesse matrix (Jacob matrix) of $S$. It is easy to show $\left[ H S \right]_{rs} = 2 \sum_i w_i \frac{\partial f_i}{\partial\theta_r}\frac{\partial f_i}{\partial \theta_s} - 2 \sum_i w_i (y_i - f(x_i)) \frac{\partial^2 f_i}{\partial\theta_r\partial\theta_s}$ Ignore the second term, as the residue and the second derivative should be small. Then the Gauss-Newton algorithm gives the iteration procedure: $\theta^{(k+1)} = \theta^{(k)} + \frac{\mathbf{J}^T\cdot \mathbf{W}\cdot (\mathbf{y}-f(\mathbf{x};\theta^{(k)}) )}{\mathbf{J}^T\cdot\mathbf{W}\cdot\mathbf{J}}$

### Levenberg-Marquardt algorithm

Levenberg-Marquardt algorithm is one of the most important algorithm based on the Gauss-Newton algorithm. Let $\theta_r, r = 1,2,\cdots p$ be the fitting parameters such as $a,b,c$. Define the Jacobian $J_{i,r} = \frac{\partial f(x_i;\theta)}{\partial \theta_r}$ Define $W_{ij} = w_i \delta_{ij}$. The iterative procedure is: $\mathbf{\theta}^{(k+1)} = \mathbf{\theta}^{(k)} + \alpha \frac{\mathbf{J}^T \cdot \mathbf{W} \cdot (\mathbf{y} - f(\mathbf{x};\mathbf{\theta}^{(k)}))}{\lambda \mathbf{\Lambda} + \mathbf{J}^T \cdot \mathbf{W} \cdot \mathbf{J}}$ where $\mathbf{\Lambda}$ is a diagonal matrix, either chosen as the identity matrix or the diagonal term of $\mathbf{J}^T \cdot \mathbf{W} \cdot \mathbf{J}$, $0<\alpha<1$ is a small positive number.

Therefore, the solution $\bar\theta$ satisfies equation $\mathbf{J}^T_{\bar\theta} \cdot \mathbf{W} \cdot (\mathbf{y} - f(\mathbf{x};\bar{\mathbf{\theta}})) = 0$ or $\sum_i w_i \frac{\partial f(x_i; \bar\theta)}{\partial \theta_r} (y_i - f(x_i; \bar\theta)) = 0$ The choice of $\lambda$ is very important. Nonzero $\lambda$ can avoid local minimum but large $\lambda$ is likely to slow down the convergence. A useful strategy is to to choose a nonzero $\lambda$ to start with and decrease $\lambda$ as the iteration converges.

### Parameter Error Estimation

It is important to attach estimated error bars/bands to the fitted function (or the fitted parameters). Let's write the data as $y_i = f(x_i) + \xi_i$, where $\xi_i$ are the residuals. If we assume $\xi_i, i=1,2,3,\cdots n$ are $n$ independent random variables in Normal distribution. The likelihood of the fitted parameters can be written as $L(\theta_r, \sigma) = N_\sigma \exp\left[ -\frac{1}{2\sigma^2}\sum_i w_i (y_i-f(x_i;\theta_r))^2\right] \\ \simeq N'_\sigma \exp\left[-\frac{1}{2\sigma^2} \sum_{rs} (\theta_r-\bar\theta_r) A_{rs} (\theta_s-\bar\theta_s) \right]$ where $A_{rs} = \sum_i w_i \frac{\partial f(x_i;\bar\theta)}{\partial\theta_r} \frac{\partial f(x_i;\bar\theta)}{\partial\theta_s} = \left[ \mathbf{J}^T \cdot \mathbf{W} \cdot \mathbf{J} \right]_{rs}$  and $N_\sigma, N'_\sigma$ are normalization factors.

The variance $\sigma^2$ can be estimated by the estimator $\hat{\sigma}^2 = \frac{1}{n-p}\sum_i w_i (y_i - f(x_i; \bar\theta))^2 = \frac{S({\bar\theta})}{n-p}.$ The covariance matrix of the parameters is, $\text{Cov}(\theta) = \hat\sigma^2 A^{-1}$ It's also useful to find the prediction band. The width of the band is the variance of function $f(x; \theta)$. The "Delta Method" gives an approximative way to calculate $\text{Var}\left[ f(x; \theta) \right]$ from $\text{Cov}[ \theta ]$. $\sigma_f(x) = \text{Var}\left[ f(x; \bar\theta) \right] = \frac{\partial f^T(x;\bar\theta)}{\partial\mathbf{\theta}}\cdot\text{Cov}( \mathbf\theta) \cdot \frac{\partial f(x;\bar\theta)}{\partial\mathbf{\theta}} \\ = \hat\sigma^2 \sum_{r,s} \frac{\partial f(x;\bar\theta)}{\partial\theta_r}\left[ A^{-1} \right]_{rs}\frac{\partial f(x;\bar\theta)}{\partial\theta_s}$
Finally,the distribution of the residuals can be test by the standard $\chi^2$-test.

### Example: a power-law fitting

In our example, we will explore fitting some data to the power-law function $f(x) = a + b/x^c$.

 Fig. 1: (Color Lines) Black, the theoretical curve; Red, fit with uniform weights; Green fit with weights proportional to $x$

The corresponding covariance matrices for the parameters $a,b,c$.

The diagonal terms gives the variance of the parameters.