This is a non-exhaustive list, meant to be a supplement to your own study notes! You should be very familiar with everything in the homework and quizzes and everything in the lecture slides/notes class discussion/board work ---------------------------- linear algebra basics know how to do matrix addition/multiplication (including when the dimensions are such that you can not do it) know what it means to be an eigenvector of a matrix (understand and know the implications of the equation A v = lambda v) -- You do NOT have to be able to compute eigenvectors/eigenvalues by hand ------------------- matlab -be familiar with all the commands you have used in homework ; ' : find .^2 ... length plot repmat for know how to use vectorized commands sin([1 2 3] = 0.8415 0.9093 0.1411 know how to use them to eliminate loops know how to write a matlab function function [x,y,z]=cylinder(r,n) ... ---------------- probability Know the 1-D Gaussian equation (and what the terms mean -- how to read off the mean and variance) Know the relationship between variance and standard deviation (std dev is the square root of the variance). Know that the integral under a valid pdf is 1 Know the equation for (and implication of and how to use) the expected value -- know that this is the mean Know the equation for (and implication of and how to use) the variance Be able to do the Gaussian integrals from using your knowledge of pdfs, expected value and variance Know the probability axioms and equation for indpendence, conditional dependence and Bayes rule (and how to use them) ----------- PCA understand covariance matrices (how to go from matrix to an idea of what the scatter cloud of data would look like). How to compute them. The expectation equation Covariance = E ((X-mu)(X-mu)') understand why the sample covariance matrix for a dataset is 1/n \sum (X-Mu)(X-Mu)' where X is matrix with data as columns and Mu is matrix with mean as columns (repeated Numdatapts times -- know how to use repmat to get this matrix) understand graphically how to extract principal component directions understand the operation of rotating the axes, removing the directions of least variance and reconstructing the input using fewer principal components understand why one would use PCA understand algorithm and MATLAB code for running PCA (especially your code from the homework -- including the transpose trick when the covariance matrix is very large.) subtract the mean m = mean(hw3data, 2); centeredData = hw3data - repmat(m, 1, numSamples); compute covariance matrix C = (1 / (numSamples )) * centeredData * centeredData'; compute eigenvectors of covariance matrix (and sort by decreasing eigenvalue) [V, D] = eig(C); [V, D] = eigsort(V, D); understand algorithm and MATLAB code for transforming new points Components in PCA space are V'*mean subtracted data projectedData = V' * centeredData; (we have also called projected data c) Understand how to reconstruct with fewer principal components (and why one would do this) reconstructedData = V(:,1:N)*projectedData(1:10) + m ---------------------------------- ---- START OF MIDTERM 2 MATERIAL ------------------------- K-means know algorithm and be able to use it and when you would use it (to cluster compact circular clusters) ----------- Least squares linear regression know how to solve simple sets of linear equations by hand understand the \ operator A\B is the matrix division of A into B, which is roughly the same as INV(A)*B , know how to convert sets of linear equations to the form A mb = y mb = A\y understand that the "minimum square error solution is the one that minimizes \sum_{(x,y) pairs} (y-mx-b)^2 Understand how to extend linear regression to non-linear regression Understand how to set up the problem to solve y=ax +bx^2 + ..... Understand how to plot solutions xp=0:.1:5; yp=mnb(1)*xp+mnb(2)*xp.^2 + mnb(3); plot(xp,yp,'g') ------- Overfitting understand the problem (too few data points to constrain a complex model) understand methods to estimate overfitting (leaving out datapoints to approximate how well future datapoints will be fit) -------- Nelder-Mead algorithm Understand how to use it and how it works The method works with a number of rules. The starting point is used to construct a simplest, a shape with m+1 points, where m is the number of parameters. Thus for a two parameter problem there are three points, a triangle. The program calculates the WSS at each point of the simplex on the WSS surface. The Rules * Reflect the point with the highest SSE (x(n+1) in figure below) through centroid (center) of the simplex (using alpha) (this gives r in figure in nonlinnotes.html) * If this is just a good point (intermediate between other SSE values) accept the reflected point as the new simplex point and terminate this iteration. * If this produces the new lowest SSE (best point) test a point that expands the simplex further (using beta - s in the figure below) -- if this point is worse than the original reflected point, accept the original reflected point (r), otherwise accept the new expanded point (s). In either case terminate this iteration. * If this is the highest SSE (worst point) test a point that contracts the simplex and reflects closer (using gamma) * * For contraction, if the new point (r) is worse than the old point that is being reflected (x(n+1)) then contract back from the old point (contract back to cc), otherwise contract back from the new point (contract back to c). * * If the new contracted point (c or cc) is worse than the reflected point (r) , undo the whole step (from reflection) and do shrink operation instead * * * In the shrink operation all but the best SSE points are brought down towards the best (lowest SSE) point. (e.g. x(n+1) moves to v(n+1) in the figure in nonlinnotes.html - each of the other points other than x(1) also moves towards x(1)) Terminate this iteration These rules are repeated until the convergence criteria are meet. The simplex moves over WSS surface and should contract around minimum. Know the compression rule -- If the new point is worse than the old point compress from the old point, otherwise compress from the new point. Understand how fminsearch is used to find a good fit (want to minimize the mean square error of the fit) Understand how to set up the hybrid approach to fit functions such as y = ax + bx^2 + c*exp(-d*x) + e where the function is linear in most of the parameters -------------------- Neural networks understand how to compute activations (understand what an activation function (e.g. sigmoid,linear,threshold) is and how this affects the output) w_{ij} = weight from input j to output i understand why simple one-layer networks can compute planar decision boundaries Decision boundary -- the boundary between where the neuron outputs 0 or 1 (for threshold units), crosses through .5 for linear/sigmoid units Understand why the weight vector is perpendicular to the decision boundary given by the network w_1 x_1 + w_2 x_2 >= threshold And if we use the convention w_1 x_1 + w_2 x_2 + b (GREATER THAN or EQUAL) then the weight vector points in the direction where output 1 is given replacing the treshold with a bias weight threshold = -b bias weight can be dealt with as a weight from a unit with activation always 1 Perceptron learning Rule know it know why it makes sense know when it converges that it doesn't converge for non-linearly separable problems For inputs that are classified correctly -- no change For inputs that have output 0 and should be 1 -- move weight vector towards input (by adding input vector to weight vector) For inputs that have output 1 and should be 0 -- move weight vector away from output (subtracting) w_1^new = w_1^old + (target-output) x_1 w_2^new = w_2^old + (target-output) x_2 b^new = b^old +(target-output) Note that algebraically the algorithm also makes sense w_1^new = w_1^old + (target-output) x_1 Perceptron Convergence Theorem -- for any data set which is linearly separable, the PLA is guaranteed to find a solution in a finite number of steps What about non-linearly separable problems? -- will not converge (continually bounces around) pocket algorithm -- variant on the PLA -- stores best solution and only stores new weights if solution is better (does the best it can) To solve non-linearly separable problems we need to have multi-layer perceptrons with non-linear activation units (know why) --END OF MIDTERM 2 MATERIAL ------------- In order to perform gradient descent we need differentiable activation functions on the hidden units Commonly used activation function is the sigmoid function f(x) = 1/(1+e^{-x}) = sigma(x) gradient vector is just the vector of partial derivatives (with respect to each parameter) Gradient(f(x_1,x_2)) = [ df/dx_1 df/dx_2 ]' gradient vector points in the uphill direction Gradient descent algorithm x_1 (new) = x_1 (old) - eta (df/dx_1) derivative of sigmoid function is sigma(x)(1-sigma(x)) know how to find weights for a neural network using Nelder-Mead (Know disadvantages of this) ---------------- Multi-layer perceptrons Know the reason for the back-propagation algorithm and what is being back-propagated (deltas) (understand that it is just the chain rule applied to the square error function but done efficiently (by storing the deltas)) delta_j = -dE/d net_j where net_j is the net input to unit j delta_j = f'(net_j) \sum_k w_{kj} delta_k (remember w_{kj} is connection from neuron j to neuron k) understand the above equation and what it means for efficiency of backprop (deltas are back-propagated just like activations are forward propagated) Back-Prop for j=1 to epochs for i = 1 to pattern (online algorithm would randomize order) compute forward pass compute error compute deltas backpropagate compute weight changes eta* (delta_j*y_i) batch algorithm would close i loop here update weights online algorithm would close i loop here end of j loop online weight update w_{ji}(new) = w_{ji}(old) + \eta \delta_j y_i where y_i is the currently present activation of ith unit batch weight update w_{ji}(new) = w_{ji}(old) + \eta \sum_p \delta_j y_i(p) where y_i(p) is the activation of ith unit during presentation/propagation of pattern p ------------- Faster alternatives to Gradient Descent Know why gradient descent is slow (and problems with picking the learning rate) Gradient descent with momentum delta_w(t) = -\eta * dE/dw + mu * delta_w(t-1) Variable learning rate Conjugate Gradient Algorithms for initial weight vector w 1. find steepest descent direction d1=-g1 2. At step j Perform a line search in direction dj (minimize E(w+\alpha*dj wrt alpha) 3. Test for stopping criterion 4. Compute steepest descent direction g_{j+1} 5. Compute conjugate direction d_{j+1} using one of several formulas 6. j=j+1 goto step 2 Conjugate gradient algorithms choose the new search direction not to be the negative gradient direction, but one that has the component of the gradient parallel to the last search direction equal to zero (to first order). understand how to determine this direction graphically Understand how a golden section line search is performed (and what improvements you could add) ------------------ Methods for preventing overfitting regularization - adding a penalty to the usual error function to encourage smoothness. weight decay Enew = E + \nu * Omega nu is the regularization parameter and Omega is the smoothness penalty Weight decay sets Omega = 1/2 \sum_i w_i^2 this means that -dEnew/dw_i has an additional -w_i term (hence the name weight decay) Bayesian Regularization The Bayesian neural network formalism of David MacKay and Radford Neal, considers neural networks not as single networks but as distributions over weights (and biases). The output of a trained network is thus not the result of applying one set of weights but an average over the outputs from the distribution. Early Stopping understand how and why to use it It depends on neural networks being less non-linear earlier in training due to the sigmoids being in the linear range. Requires validation data to be suitable proxy for test data --------------------------------------------------- overall -- how do the different optimization methods compare (strengths weaknesses) when is a neural network a better thing for function fitting than explicit function fitting (linear or non-linear) using N-M or non-linear regression method? neural-network good when you don't know the functional form If you want to fit a specific form best to use non-linear regression if you can, otherwise might want to use N-M or a gradient method on the specific functional form (if gradients are cheap to compute).