ATM S 591: Predictability and Data Assimilation

Winter Quarter 2009.

Instructor: Greg Hakim

syllabus  |  resources  | lecture log  |

11 March 2009 (15): I gave a very brief overview of the different kinds of model error. In terms of correcting model error, one established technique is parameter estimation. I discussed the "augmented state vector" approach to parameter estimation, and explained how this works in traditional Kalman filtering, or in ensemble Kalman filtering. I presented an example for the Lorenz (1963) attractor, where one of the three parameters was estimated from observations usingan EnKF. The technique works very well when the model is perfect aside from the unknown paremeter, but if a second parameter is wrong, the estimated parameter does not converge and drifts around to accomodate the error from the other parameter.

9 March 2009 (15): Adjoint sensitivity; targeting; network design.

4 March 2009 (15): Observation impact; ensemble sensitivity.

2 March 2009 (15): We discussed the power of adjoint sensitivity analysis in the context of the sensitivity of forecasts to initial-conditions. Calulating and storing the full propagator matrix is impractical for high-dimensional problems. Since we are often interested in an aspect (functional) of the forecast state, we defined a metric, J, to measure sensitivity. With a single integration of the adjoint model, one can determine the sensitivity of J to all degrees of freedom in the initial condition. I argued that adjoint sensitivity can be used to understand how 4DVAR works, and that singular vectors (eigenvectors of M^T M) orthogonally decompose the sensitivity field when J is defined as forecast variance. The first application of adjoint sensitivity we discussed was dynamical sensitivity, where one wishes to understand the structures in the initial state that strongly influence a metric of the solution state. I showed examples for a case of idealized extratropical cyclogenesis by Langland et al. (1994). The second application of adjoint sensitivity is observation impact, where one wishes to assess the impact of observations on a future forecast. We derived the observation impact equation by substituting for the initial condition perturbation from the Kalman update equation. Examples were shown from Langland and Baker (2004).

25 February 2009 (14): Kalman smoother day. I started by arguing that, in an ensemble framework, if we know the observation estimates for asynchronous observations, that we have all of the information we need to update the state at the analysis time based on these observations. I formalized this heuristic approach by defining a cost function that included a prior and observations at a second time, and derived the gain for the new observations. e discussed the many flavors of smoothers, including fixed point, fixed interval, and fixed lag smoothers.

23 February 2009 (13): 4DVAR day. Using 3DVAR as motivation, I discussed the notion of extending the cost function to include a fit-to-observations over time. The time interval typically extends before, and after the analysis time of interest; for atmospheric large-scales the fit often covers 12 hours, ranging from 9 hours before the analysis time to 3 hours after. I derived the cost function gradient, and listed a basic algorithm: (1) a nonlinear foward integration provides the source for observation estimates; (2) an adjoint model integration is started from the end of the time interval with zero perturbation and integrated backward into time, with "nudged" forcing from normalized innovations at observation times; (3) the resulting initial-time perturbation gives the cost-function gradient, which may be scaled and added to the initial condition; (4) return to (1). Once the cost function minimum is found, the trajectory starting from the initial time passes through the analysis time, and continues on to future forecast times.

18 February 2009 (12): I started by talking about ensemble filters and sampling error. I reviewed commonly used remedies for sampling error, including covariance localization and inflation. A number of other techniques were briefly discussed. We discussed 3DVAR by returning to the cost function for Gaussian statistics. The notion of a search for the cost function minimum was introduced, along with the possibility of being fooled by local minima. Common search algorithms step in the direction of steepest descent, which requires knowing the cost function gradient. After deriving the gradient, we outlined an algorithm, along with some possibilities for making it more efficient ("incremental" 3DVAR).

11 February 2009 (11): Ensemble filters, part II: square-root filters. As an alternative to perturbed observations, we considered the problem of solving for the analysis error covariance from the ensemble. This led us to square-root filters, where one updates the square root, which for the ensemble approach is the prior ensemble. We showed that the square root is not unique, even for full-rank ensembles. One path leads to a transformation of the ensemble by a right matrix multiplication, and the other to a left multiplication. The latter is of the EAKF flavor, and the former is of the ETKF flavor. The ETKF transformation matrix has the property of being square in the size of the ensemble, and I argued it provides a cheap and inexpensive means of generating ensemble initial conditions (ensemble mean may be taken from an operational analysis).

9 February 2009 (10): I started off with an interpretation of the Kalman update equation and the Kalman gain. BH^T gives the covariance between the estimated observations, Hx, and the state variables; that is, the prior covariance "spreads" the influence of observation spatially and to other variables. HBH^T is the model prior counterpart to R, the observation error covariance matrix. Together, HBH^T + R gives the innovation covariance and, taking the innovation as the independent variable, we can interpret the analysis increment as a linear regression of the innovation onto the state variables.

I also derived the analysis error covariance, and defined the Kalman filter algorithm: (1) given estimates of the prior expected value and covariance, update the expected value by x^a = x^b + K ( y - Hx^b) and the covariance matrix by A = (I - K H) B; (2) forecast the future expected value and covariance matrix with the linear model; (3) return to (1).

For high dimensional problems, even the KF computation is too large, which motivates an ensemble approach to the Kalman filter. I reviewed the basics of constructing ensemble state matrices and estimating means and covariance matrices. We discussed the problem with giving all ensemble members the same observations, and causes of filter divergence. The first remedy for this problem led us to the perturbed observations approach to the ensemble Kalman filter.

4 February 2009 (9): Kalman filter day. For high-dimensional problems, the nonlinear--non-Gaussian approach described in the previous lecture is computationally impossible. Making the Gaussian approximation for the statistics of the observations and prior yields the update equation for the Kalman filter. Using methods from the first few lectures, we found the Kalman gain by setting the derivative of the cost function to zero and solving for the posterior.

2 February 2009 (8): Ensemble day. I started with the idea of sampling from a distribution and offered a question about why it is that we can get away with such small ensembles in atmospheric sciences. I gave a highlight review of ensemble verification (Murphey 1988). For a single ensemble, the mean and variance need not reflect the mean and variance of the true distribution. Treating the ensemble mean and variance as random variables shows that, if the model is perfect, the expected value of the ensemble mean and variance equal the mean and variance of the true distribution. Defining forecast errors as the squared difference between the ensemble mean and the true state, and treating the ensemble mean as a random variable, yields two main results: (1) for large ensembles the mean error in the ensemble mean is equal to the mean ensemble variance; (2) for large ensembles, the error in the ensemble mean is equal to 1/2 the error of a single deterministic forecast. These results formally apply to expected values taken over many ensemble realizations at each instant in time, and then averaged over the entire attractor.

We started the state-estimation portion of the class. I began with definitions for sequential and distribution (in time) methods, and "direct solve" and "variation" solution methods. Picking up from our Bayesian approach from earlier lectures, we quickly arrived at a recursive algoritm involving normalizaed products of probability densities and forecasts from the Liouville equation. I showed examples plots for a scalar variable.

28 January 2009 (7): I started with a methodology for computing Lyapunov vectors and exponents, the key aspects of which involve orthogonalization and normalization. For orthogonalization, I reviewed Gram-Schmidt, SVD, and QR. We concluded that QR is best because it preserves the direction of the leading vector, and is more efficient than Gram-Schmidt; SVD doesn't work because direction is not preserved. I reviewed results for the Henon attractor, showing convergence of the Lyapunov exponents, and the effect of ortgonalization on the evolved vecttors.

Our conclusion based on the stability analysis is that if the system is chaotic, then deterministic prediction is fundamentally flawed. This leads to the notion of evolving probability density on the attractor, rather than an individual state. The Liouville equation formalizes this notion. It is essentially mass continuity applied to probability density, with state-space velocty, dx/dt, playing the role of fluid velocity in the mass continuity equation. Probability density increases (decreases) where state-space trajectories converge (diverge). Since it is linear in probability the Liouville equation is analytically solvable using the method of characteristics. Solutions show that, for chaotic attractors, probability density decreases exponentially along characteristic curves at a rate set by the leading Lyapunov exponent. I showed a one-dimensional example from Ehrendorfer (2007).

For high dimensional problems, the curse of dimensionality renders impractical solution of the Liouville equation. This leads to the notion of ensembles, and sampling. I showed a Liouville-type calculation based on a large ensemble for a tiling of the Henon attractor. An initially concentrated blob of probability diffuses across the attractor until in becomes stationary in time ("climatology").

26 January 2009 (6): I reviewed stability theory from last week (linear, time independent systems), and continued to more general problems involving time dependence and nonlinearity. We defined Lyapunov exponents and vectors. The leading Lyapunov vector is independent of norm, whereas all others depend on the norm.For the fully nonlinear time-dependent problem, we showed that we could get back to known territory by making a leading-order Taylor approximation to the system dynamics. The Jacobian matrix defines the propagator, which depends on time and the trajectory about which we linearize.

21 January 2009 (5): Stability theory was extended to consider "optimal stability" of finite time intervals. A mathematical view of the problem was given by comparing the similarity transformation with the singular value decomposition for the propagator. By defining a measure of amplification in terms of a ratio of norms, one at time t and one at the the initial time, we found that stability was determined by the eigenvectors and eigenvalues of {\bf M}^T {\mf M}; these are the singular vectors and singular values of {\bf M}, respectively. The results of this analysis will differ from the

14 January 2009 (4): I finished the information theory segment by talking about metrics and information channels. I defined relative entropy, which provides a norm for comparing the "distance" between two distributions, and the mutual information, which is an application of relative entropy for measuring the entropy reduction in one random variable from knowing another. I defined an information channel as the conditional probability linking inputs and outputs, and discussed how this relates directly to the problems of prediction (initial value problem) and estimation. More generally, the channel sits in a communication system linking a message source with a message receiver. The source messages are first processed by an encoder, then sent through the channel where they are then decoded, producing an estimate of the message. We considered the simplist channel having noise (error) in communication: the binary symmetric channel (BSC), which has messages of zero or one that flip sign in transmission with probability p. Channel capacity was defined and used in the statement of Shannon's noisy channel coding theorem, which basically says that messages can be sent over a noisy channel, with arbitrarily low error rate, up to the capacity of the channel, provided that the entropy of the source is below the channel capacity. This theorem says nothing about how to achieve this transmission rate, only that it is possible. What makes it possible is the source coding scheme, which must build in error correction measure to counteract the noise. Since these measure add to the length of the code words, which consume capacity, the challenge is to find highly efficient schemes that correct errors while not increasing the length of the code words.

We started a new segment on stability of dynamical systems. After first defining the general problem in terms of state vectors of a system and their time tendency in terms of vector-valued functions (and neglecting control variables), we solved the problem where the system matrix is linear and time independent.

12 January 2009 (3): The Bayesian estimation approach was applied to the same problem solved in least squares: one observation and one one parameter. The statement of the problem is subtley, but importantly, different in that we p(x|y) implies that we have prior knowledge of the distribution for x. Assuming Gaussian statistics, Bayes theorem gave a product of Gaussians, which is Gaussian. We showed that the solution is exactly that of least squares, for two observations, provided that we interpret the estimate of the parameter from the first observation as the prior. I showed that the results for both can be written in terms of the prior and a gain, K, which weights the difference between the prior and the observation. Furthermore, the posterior variance is reduced by (1-K). Extending the Bayesian approach to three variables, the parameter, an observation at the current time, and an observation at an earlier time, led to the general, recursive, state estimation formula we will revisit more formally later in the quarter.

I started the information theory segment, by pointing out that the definitions are all applications of probability concepts, applied to a specific function of the probability (mass or distribution) function: the entropy. The entropy of a random variable is maximal for a uniform probability distribution. I then defined joint and condition entropy, in much the same way I did for joint and conditional probability, noting that the log function allows products to be written as sums.

7 January 2009 (2): Review of the least squares solution and the relationship between the solution derived from a "constraint" based on a metric, J, and one where we simply invert H. This led to a review of the singular value decomposition and the pseudo-inverse for non-square and rank-deficient matrices. I reviewed the basics of probability theory including: fundamental definitions and axioms, joint probability, conditional probability, and likelihood. I showed how joint probability can be "factored" using conditional probabilities, and then used the results to derive Bayes' Theorem for two and three random variables.

5 January 2009 (1): Overview of class and motivation. If you use models and/or observations in your research, this class provides a generic framework to analyze and combine those estimates of the system of interest. After a review of the most important near-term topics in the linear algebra primer, I derived some basic solutions to the least squares problem. We defined a linear model of noisy observations, and discussed that for the typical over-determined problem (more observations/equations than unknowns), a constraint must be introduced. The first constraint, or metric, we considered was the error variance of the observations; setting the gradient in the metric with respect to the unknowns to zero determined a solution. We showed that choosing other inner-product-based metrics was equivalent to changing coordinates or the covariance matrix of the observation error. In both cases, a change of variables transformed the problem back to the original for which we have a solution. An interpretation was given for one and two observations of a single scalar. In the case of one observation, we find only that x = y. In the case of two observations, we find that the best estimate of the state, x, is given by a linear combination of the observations, weighted by their error variance.