Multivariate Data Analysis
Previously known as multivariate statistical analysis


There are essentially only four steps here:

  1. Low-pass filtration
  2. Alignment in two dimensions
  3. Dimension-reduction -- expression of a mxn image using only a few terms, i.e., eigenvectors
  4. Classification

The low-pass filtration is optional, but if you plan to look at individual particles, this step will help.

For the classification below to be sensible, the images will need to have been aligned. The alignment step here is optional if the images have been aligned already.

The dimension-reduction step is even optional, in theory. In principle, one could classify the raw images (which is what SPIDER operation 'AP C' does). As an example here, I'm using correspondence analysis for the dimension-reduction. A similar method is principal-component analysis (PCA); to run PCA, one needs to change an option under SPIDER operation 'CA S' (here, in the batch file ca-pca.msa).

For classification, there are three methods illustrated here: Diday's method, Ward's method, and K-means. The individual classification operations are described in more depth in the classification tutorial.


Getting started


Procedure

  1. Low-pass filtration

  2. Reference-free alignment. -- choose one of these two options:
    1. Using 'AP SR'
      • BATCH FILE: apsr4class.msa
      • INPUT PARAMETER: object diameter (pixels, after decimation)
      • INPUTS: unaligned particles, selection file
      • OUTPUTS: aligned particles, averages

        There may to be a memory limit in 'AP SR'. If you get a core dump, truncate the selection file and try again.

    2. Using pairwise alignment
      • BATCH FILE: pairwise.msa
      • INPUT PARAMETER: object diameter (in pixels, after decimation)
      • INPUTS: unaligned particles, selection file
      • OUTPUTS: aligned particles, averages

        Conceptually, this alignment first aligns pairs of images and averages them. Then, it aligns pairs of averages of those pairs and averages them, and so forth. This type of alignment appears to be less random than does 'AP SR', which chooses seed images as alignment references.

        Reference: Marco S, Chagoyen M, de la Fraga LG, Carazo JM, Carrascosa JL (1996) Ultramicroscopy 66: 5-10.

  3. Dimension-reduction
  4. Classification -- choose one of three options:
    1. Diday's method, using 'CL CLA' -- I hear that this method works exceedingly well. In practice though, I find that I have limited control over the number of classes, which may or may not be a problem depending on the application. Also, I sometimes get errors with large data sets with this method.
      • BATCH FILE: cluster.msa
        • INPUT PARAMETER: number of eigenfactors to use
        • OUTPUT: dendrograms (PostScript and SPIDER formats)

          After running, decide how many classes to include. using WEB/ JWEB (Commands -> Dendrogram) and clicking on Show averaged images.


        dendrogram
        PostScript format

        (click to enlarge)

      • BATCH FILE: classavg.msa
        • INPUT PARAMETER: desired number of classes
        • OUTPUT: class averages

    2. Ward's method, using 'CL HC' -- The advantage is that, unlike Diday's method above, the dendrogram branches to any desired number of classes, down in size to individual particles. The disadvantage is that the dendrogram is unreadable if there are so many branches. You can truncate the dendrogram in WEB/JWEB as described below.
      • BATCH FILE: hierarchical.msa
        • INPUT PARAMETER: number of eigenfactors to use
        • OUTPUT: dendrograms (PostScript and SPIDER formats)

          After running, decide how many classes to use. The PostScript file may be highly branched, and nodes may be unreadable.


          Untruncated dendrogram

          (click to enlarge bottom row)

          The SPIDER-format dendrogram document can be viewed with WEB/JWEB and truncated. In WEB, go to Commands -> Dendrogram (example). In JWEB, go to File -> Open SPIDER Document File.


          Dendrogram in X-Window
          WEB
          (click to enlarge)

      • BATCH FILE: classavg.msa
        • INPUT PARAMETER: desired number of classes
        • OUTPUT: class averages

    3. K-means classification, using 'CL KM' -- The primary input is the number of classes to divide the particles into.
      • BATCH FILE: kmeans.msa
      • INPUT PARAMETERS: number of eigenfactors, number of classes
      • OUTPUT: class averages

        It can be informative to look at the individual particles from a class. You can use WEB/ JWEB, or montagefromdoc.py.
        Usage:
        ./montagefromdoc.py   KM/docclass001.dat
        If you have requested too many classes, there will be similar-looking class averages. If you have requested too few, there will be dissimilar particles within a class.


Miscellaneous tools:


Recent modifications:


Source: techs/MSA/index.html     Page updated: 2014/02/05     Tanvir Shaikh