Principal Component Analysis (PCA) is a statistical technique used in data analysis and for building predictive models. The technique involves transforming a dataset into a new basis whereby the transformed data is uncorrelated. The transformed basis, which can be represented by an orthogonal matrix, defines the Principal Components of the original dataset. These basis vectors are usually ordered so that the first principal component is the one that accounts for the largest variance in the data, and the last component accounts for the least variance. This article introduces PCA theory and illustrates and example using the D3X Morpheus library.

Principal Component Analysis (PCA) is a statistical tool used in data analysis and also for building predictive models. The technique involves transforming a dataset into a new basis whereby the transformed data is uncorrelated. The transformed basis, which can be represented by an orthogonal matrix, defines thePrincipal Components of the original dataset. These basis vectors are usually ordered so that the first principal component is the one that accounts for the largest variance in the data, and the last component accounts for the least variance.

PCA is also often referred to as a dimensional reduction technique in the sense that by dropping components that account for a negligible amount of variance in the original data, we can linearly map the data into a lower dimensional space without loosing material information. The assumption here is that the variability in the data represents the essential dynamics we are trying to understand, so dropping dimensions with negligible variance results in minimal loss of information.

The following sections introduces some PCA theory, and then proceeds to illustrate an example of how to use the Morpheus APIto perform PCA on a dataset. Only a superficial overview of the theory is covered below, so for a more detailed treatment of the topic I would suggest a Google search, or perhaps this tutorial as a primer.

As a data analysis technique, PCA begins with the definition of our data which in general can be described in two dimensions, namely the number of observations and the number of measurements. Such data can be represented by an **nxp** matrix where n represents the number of observations for each measurement, and p represents the number of measurements being recorded.

The dimensions in which we record the observations that constitute our data are assumed to be a naive basis, since we do not actually understand the true dynamics of the system we are investigating (hence the analysis). Having said that, our hope is that while our measurements may be recorded in a naive basis, they are informative enough so that we can compute a new basis that maximizes the signal to noise ratio and removes any redundancy in the data, enabling us to better understand its true dynamics.

With this in mind, let us assume that there exists an orthogonal matrix \(V\) of dimensions **pxp** that can transform our data \(X\) into \(Y\) such that the covariance matrix of \(Y\) (denoted by \(\Sigma_{y}\)) is diagonal.

$$ X V = Y $$

The question then is how to find \(V\)? We can tackle this by working backwards based on our desire that the transformed data \(Y\) has a diagonal covariance matrix, given our motivation to remove noise and redundancy from our dataset. Assuming that \(Y\) is **centered** or **demeaned**, we can define the covariance of \(Y\) as follows:

$$ \Sigma_{Y} = \frac{1}{n-1} Y^T Y $$

Given our transform \(X V = Y \) we can express the covariance of \(Y\) in terms \(V\) and \(X\) as follows:

$$ \begin{align}

\Sigma_{Y} &= \frac{1}{n-1} Y^T Y \\\\

&= \frac{1}{n-1}(XV)^T(XV) \\\\

&= \frac{1}{n-1}V^T X^T X V \\\\

\end{align} $$

Let us define a new matrix \(A\), which by definition is a **pxp** symmetric matrix as:

$$ A = \frac{1}{n-1} X^T X $$

We can therefore re-write our earlier expression for \(\Sigma_{Y}\) in terms of \(A\) as:

$$ \Sigma_{Y} = V^T A V $$

In the next section, we illustrate how we can choose \(V\) such that we diagonalize \(\Sigma_{Y}\).

Based on our earlier discussion, we know that we need to choose a transform matrix V such that \(V^TAV = D\) where \(D\) is a diagonal matrix. Given that \(A\) is symmetric, we know that we can factorize it using an Eigen decomposition into an orthogonal matrix of its eigenvectors and a diagonal matrix of its eigenvalues:

$$ \Sigma_{Y} = V^T A V $$

In this factorization the column vectors of \(Q\) are the eigenvectors of \(A\) and the diagonal elements of \(\Lambda\) are the eigenvalues of \(A\). Recall that an eigenvector of a matrix is one which is only scaled and not rotated when operated on by that matrix. This is otherwise stated as \(Av = \lambda v\) where \(v\) is a px1 eigenvector of \(A\) and \(\lambda\) is the corresponding eigenvalue, which is a scalar. We can therefore expand this to say that \(A Q = Q \Lambda\) which then implies that \(A = Q \Lambda Q^{-1}\).

If we choose our matrix \(V = Q\) and noting that \(Q\) is an orthogonal matrix such that \(Q^{-1} = Q^T \) we can plug these back into the expression for the covariance matrix of \(Y\) to yield the following:

$$ \begin{align}

\Sigma_{Y} &= V^T A V \\\\

&= V^T (V \Lambda V^{-1}) V \\\\

&= (V^TV)\Lambda(V^{-1}V) \\\\

&= (V^{-1}V)\Lambda(V^{-1}V) \\\\

&= \Lambda \\\\

\end{align} $$

By choosing the transform matrix \(V\) to be the eigenvectors of \(A\) (which in essence is the covariance matrix of our data on the assumption that \(X\) is centered), we end up diagonalizing the covariance matrix of the transformed dataset, which is the ultimate objective of our PCA. These eigenvectors essentially define the new basis along which variance in the data is maximized, and the corresponding eigenvalues define the magnitude of this variance. As noted earlier, the eigenvectors or principal components are usually ordered so that the first component accounts for the largest variance (ie the largest eigenvalue), and the last component accounts for the smallest variance (ie the smallest eigenvalue).

The previous section demonstrated that an eigen decomposition of the covariance matrix of \(X\) yields the set of eigenvectors \(V\) that define the principal axes of our data. When we transform our data using V the resulting dataset has a diagonal covariance matrix which reflects that our new basis maximizes the signal to noise ratio and/or removes any redundancy.

The Morpheus library supports solving PCA in this way, but by default, it performs a Singular Value Decomposition (SVD) of the **centered** or **demeaned** data in \(X\), as this generally offers better numerical stability and also tends to be faster then an eigen decomposition of the covariance matrix of \(X\). An SVD of \(X\) on the assumption that \(X\) is centered yields

$$ X = U S V^T $$

where \(S\) is a diagonal matrix of singular values, and the columns of \(U\) and \(V\) are called the left-singular vectors and right-singular vectors of \(X\), respectively. If we substitute the SVD decomposition into the expression for the covariance of \(X\) as below, we can see that the solution is essentially equivalent to above, however in this case the eigenvalues are equivalent to the square of the singular values divided by \(n-1\).

$$ \begin{align}

\Sigma_{X} &= \frac{1}{n-1} X^T X \\\\

&= \frac{1}{n-1} (U S V^T)^T(U S V^T) \\\\

&= \frac{1}{n-1} VSU^TUSV^T \\\\

&= \frac{1}{n-1} VS^2V^T \\\\

\end{align} $$

In this example, we demonstrate the use of Principal Component Analysis as a dimensional reduction technique, and in particular we apply it to the problem of image compression. Consider the photo below of my dog who is called Poppet, which is 504 pixels wide and 360 pixels high. The pixels that make up this image can be thought of as a 360x504 matrix, where the elements represent the color of each pixel. It is most common in computer graphics to represent such an image using the RGBA Color Space where each pixel is defined by a 32-bit integer which has encoded within it four 8-bit values representing its red, green, blue and alpha intensity. Since each component within the RGBA value is represented by an8-bit sequence, they have a range between 0 and 255 in base-10.

We can load the target image into a Morpheus DataFrame of RGBA values using the code below. Here we initialize a frame of integer values with the row and column count based on the image dimensions.

Given that each pixel is represented by a 32-bit RGBA value, we can decompose the 360x504 matrix into a 360x504x4 cube of data, or a 360x504x3 cube if we ignore the alpha channel on the assumption that each pixel is 100% opaque (which is a reasonable assumption in this case). In order to decompose the matrix into the red, green and blue components we need to perform some bitwise operations.In this example, we load the target image using java.awt.image.BufferedImage which exposes the 32-bit RGB values in the form illustrated below, where the first 8 most significant bits represent the alpha channel, the next 8 represent the red value, followed by green and then blue.

To extract the 8-bit value representing the red intensity, we first need to shift our string of bits 16 places to the right so that the 8-bits representing the value of red appear in bit positions 0-7 (which before the shift represented the blue intensity). With our bit string now in this form, we can bitwise AND it with a base-10 value of 255 or 0xFF in hexadecimal so that all bits in positions8-31 become zero, leaving only the value of our red intensity. Similarly, to extract the value of green, we right shift our RGBA bit string by 8, and then bitwise AND with 0xFF. In the case of extracting blue, no bit shifting is required and we simply bitwise AND with 0xFF. The code below generates 3 separate frames to capture the red, green and blue intensities by performing the bitwise operations just described.

Now that we know how to decompose our image into three 360x504 frames representing the red, green and blue intensity of our image pixels, we can perform Principal Component Analysis on each DataFrame, and assess the results. Since PCA is all about transforming data into a new basis in which the variance is maximized, it is often useful to get a sense of how much of the variance in the data is explained by the first N components. The code below uses the rgb Frame initialized above, extracts the red, green and blue components, performs PCA on each of these frames, and then collects the percent of variance explained by the respective principal components. This data is then trimmed to only include the first 10 components, which is then plotted using a bar chart. Note that in this example we **transpose **the DataFrame before calling the pca() method as the Morpheus library assumes that the data is an **nxp **matrix where n >= p.

We can see from this chart that in the case of the red frame, the first principal component explains around 45% of the variance, for green it is just under 35% and for blue it is just over 25%. The percentage of the variance explained by subsequent components drops off fairly monotonically, and by the time we get to the fifth component, only about 5% of the variance is captured for each of the colors.

As per the earlier discussion on PCA theory, we established that the eigenvectors of the covariance matrix of our data are the principal axes, and when combined as the columns of a matrix, serve as a transformation into the new basis. If we let \(X_i\) represent our **nxp** input dataset of either red, green or blue intensities, and \(V_i\) our matrix of **pxp** eigenvectors of the covariance matrix of \(X_i\), we can write the transform as follows (where i is either red, green or blue).

$$ X_i V_i = Y_i $$

This projection of the original data onto the new basis are called the **principal component scores**, and are directly accessible from the Morpheus interface named DataFramePCA.Model via the getScores() method. The following code demonstrates how to access these scores, and here we assert our expectation of the dimensions of these scores being **nxp** or 504x360 in this case (since we transpose the image).

Since we know that \(V_i\) is an orthogonal matrix by design, once we have transformed to the new basis of \(Y_i\) we can transform back to the original basis by taking the dot product of \(Y_i\) with \(V_i^T\) as follows:

$$ X_i = Y_i V_i^T $$

The eigenvectors that constitute the columns of \(V_i\) are arranged so that the first column is associated with the highest eigenvalue, and the last column with the lowest eigenvalue (recall that the eigenvalue represents the variance in the direction of the corresponding eigenvector). Given that much of the variance in the data will be explained by the leading eigenvectors, we consider truncating \(V_i\) by only retaining the first N columns. In this case, we can re-write the above expression using a tilde over \(V_i\) and \(Y_i\) to indicate that some information has been lost in this new transform due to the truncation of \(V_i\).

$$ X_i \tilde{V_i} = \tilde{Y_i} $$

The Morpheus API provides an over-loaded getScores() method on DataFramePCA.Model where we can generate \(\tilde{Y_i}\) by selecting only the first j columns of \(V_i\) as shown below. In this case we assert that the expected dimensions of the scores is **nxk **rather than **nxp**, where k is the number of components to include (below we use k=10).

Given the truncated scores in the form of \(\tilde{Y_i}\), it turns out that we can project these scores back onto the original basis using \(\tilde{V_i}^T\) much in the same way as described earlier. If we right multiply \(\tilde{Y_i}\) by \(\tilde{V_i}^T\) we get back **nxp** data since \(\tilde{Y_i}\) is nxk and ~ViT is **kxp**, and yields an estimate of our original data we now call \(\tilde{X_i}\)

$$ \tilde{X_i} = \tilde{Y_i} \tilde{V_i}^T = X_i \tilde{V_i} \tilde{V_i}^T $$

The Morpheus API provides a convenient API to generate \(\tilde{X_i}\) based on a specified number of components, k. The following code shows how to access the projection of our original data using only the first k=10 components, and we assert that the dimensions of this data matches our original image, namely 504x360 (since we transpose the image for reasons discussed earlier).

We have established that it is possible to reconstitute an estimate of our original data \(X_i\), which we called \(\tilde{X_i}\), from the principal component scores and a subset of the principal axes associated with the highest variance. The next question is how many columns of \(V_i\) need to be retained to ensure \(\tilde{X_i}\) is a reasonable representation of the original data? There is no hard rule in this regard, however a common rule of thumb is to select enough components to ensure that 90% of the variance is captured. Having said that, each problem will be unique, and it will often be useful to generate an eigen spectrum plot as shown above to draw some conclusion as to a reasonable initial estimate.

In the case of our image of Poppet, we will demonstrate the effect of retaining an increasing number of principal axes in \(V_i\) to compute principal component scores, and then to project this back onto the original basis. The images below show a range of scenarios where we project the image using only 5 components all the way through to 70 components. Note that this is still a small subset of the total number of components, namely 360 in this case, but it is clear that once we include up to 50 components, the transformed image is almost indistinguishable from the original, at least to the human eye.

The final image in the table above is essentially the same as the original since we retain all components and so \(V_i {V_i}^T = I \) given that we know \(V_i\) is an orthogonal matrix by design. The code to generate this array of images is shown below. Here we load the original image into a Morpheus DataFrame, and then proceed to decompose it into red, green and blue components, perform PCA on each color, project the image as described above using only a subset of the principal axes associated with highest variance, and then record the resulting projection back out as an image file.

The dimensions of our original input data \(X_i\) is 504x360 which generates a covariance matrix with dimensions 360x360.Performing PCA on this data yields a set of 360 eigenvectors each of the same length, implying that the non-truncated version of \(V_i\) is also 360x360. If we decide to only keep the first k columns of \(V_i\) to create \(\tilde{V_i}\), then the resulting dimensions of \(\tilde{Y_i}\) will be 504xk. If k can be significantly smaller than 360 (the height of the original image) then \(\tilde{Y_i}\) would require much less storage space. We obviously also need to store \(\tilde{V_i}\) so that we can reconstitute our estimate of the data \(\tilde{X_i}\), but the expectation is that k can be small enough so that the storage required for two smaller matrices is less than that required for the original image.

In our example, the original image requires 181,440 32-bit RGBA values given it has dimensions of 504x360 pixels. The table below summarizes the total number of elements required to store \(\tilde{Y_i}\) and \(\tilde{V_i}\) for various values of k ranging from 5 through 60. We need to multiply this by 3 since we need to store a red, green and blue version of these matrices. The final column indicates the percent reduction on the original number of elements we achieve, and since the image reconstituted by retaining only the first 45 components is almost indistinguishable from the original, we can achieve 35% compression in that case.

Thank you for your interest! We will be in touch shortly.