Google+

Sparse Coding

2

December 15, 2013 by earlbellinger

As the era of big data dawns upon us, we are now faced with problems where not only the number of records is massive, but also the number of features per record can be massive too. The problems that arise relating to data that is “too wide” is referred to as the curse of dimensionality. It can therefore be vital to take some measures to reduce the dimensionality of the data set while still preserving the (majority of the) information it holds.

Sparse coding is a technique for discovering a small number of “representatives” that can be used to reconstruct the original high-dimensional signal. In the linear generative model of sparse coding, given a set of k-dimensional vectors in a (possibly high-dimensional) real space, the goal of sparse coding is to find some number of basis vectors in addition to a sparse weight vector such that the linear combination of the basis vectors and the weight vector closely approximate the input vectors. (Note: there are other techniques for performing sparse coding.)

A distinguishing feature of sparse coding from other dimensionality reduction procedures (e.g. principal components analysis, singular value decomposition, etc.) is that the number of bases can exceed the input dimension. It is argued in the literature that this may make sparse coding more biologically realistic, as there is some evidence that the primary visual cortex acts in this manner (see e.g. Honglak Lee et al’s 2006 Advances in neural information processing systems paper titled “Ef´Čücient sparse coding algorithms”).

Example

For this example we’ll use python package scikit-learn and their implementation of sparse coding. In order to make this work we will also need a method to learn the dictionary, so for that we will use their implementation of mini-batch dictionary learning.

Let’s start by randomly generating some data – let’s say a 100×100 matrix.

from numpy.random import randn
from sklearn.decomposition import sparse_encode
n = m = 100 # dimensions of our input
input_x = randn(n, m)

Now let’s learn a dictionary for this data. I’ve arbitrarily chosen to recast this data into 8 dimensions. (OK – it’s not so arbitrary, it’s so that it will print nicely and fit within the box below. So sue me, right?)

new_dimensions = 8
from sklearn.decomposition import MiniBatchDictionaryLearning
mbdl = MiniBatchDictionaryLearning(new_dimensions)
mbdl.fit(input_x)

And finally let’s discover its sparse coding.

code = sparse_encode(input_x, mbdl.components_)

Let’s take a look at what our sparse coding looks like.

import numpy
numpy.set_printoptions(precision=3, suppress=True)
print code
[[ 1.136  1.642  0.     0.     2.288  0.     0.    -1.312]
 [ 5.403  0.    -0.23   0.    -0.643  0.     0.     0.   ]
 [ 0.    -4.954  0.     0.     0.     0.367  0.     0.   ]
 [ 0.    -0.399  3.109  3.88   0.     0.    -0.142  0.   ]
 [ 0.     0.     0.    -3.539  0.     0.     2.277  0.   ]
 [ 0.     0.     4.852  0.     0.     0.     0.     0.   ]
 [ 2.608  0.     0.     0.     0.     0.     1.212  0.   ]
 [ 0.    -2.732  0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.     0.     0.    -0.057 -1.109]
 [-0.067  1.324  0.     0.     2.688  0.     0.361 -1.757]
 [-0.428  0.     0.     0.    -2.794  0.     0.     0.   ]
 [ 0.     0.232 -2.303  0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.    -3.827  0.     2.814  0.   ]
 [ 0.07   0.     0.     2.722  0.553  0.    -0.268  0.   ]
 [ 0.    -0.014  0.     1.753  0.     0.     3.343  0.686]
 [ 0.     0.     0.     0.    -0.012  0.     0.     0.   ]
 [-2.562  0.    -0.893 -0.077  1.07   0.333  0.     1.435]
 [ 0.074  0.073  0.     0.     0.     0.    -0.157  0.   ]
 [ 0.    -0.431  0.    -1.514  2.793 -2.94   0.     0.   ]
 [ 0.     0.     0.    -0.021  0.     0.     0.     0.   ]
 [ 0.     0.     0.    -1.784  0.     0.     0.     0.   ]
 [ 0.     2.051  0.    -3.072 -2.106  0.    -3.602  0.   ]
 [ 0.     2.17   0.     1.984  0.     0.    -0.166 -0.217]
 [ 0.508  0.    -1.956  0.     0.     0.     0.     0.   ]
 [ 0.     1.207  2.932  0.     0.     0.     0.    -6.021]
 [ 0.     0.017  0.767  0.     0.    -4.851 -1.222  0.   ]
 [ 0.     0.     0.     0.    -0.332 -0.152  0.114  0.075]
 [-2.37   0.417  0.     0.     0.     0.     0.     0.621]
 [ 0.     0.     0.     0.     0.     0.     0.     0.   ]
 [ 2.245  0.     0.     0.1    0.     0.    -1.792  0.   ]
 [ 0.     0.     0.     0.     0.    -0.118  0.074  0.   ]
 [ 0.     0.     1.897 -0.046  0.288  0.     0.     0.171]
 [ 0.     0.027  0.     1.629  0.    -0.454  2.477  0.   ]
 [ 0.     0.     0.048  0.     0.    -0.44   2.823  0.   ]
 [-4.122  0.     0.     0.     0.     0.626  0.     0.   ]
 [ 0.     0.     0.     1.029 -1.49   0.     0.     0.   ]
 [ 0.     0.     0.     0.     0.    -3.95   0.     1.914]
 [ 2.513  0.     0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     2.314  0.     0.     1.611  0.     0.   ]
 [ 2.954  0.     0.     0.     0.     0.     0.     0.978]
 [ 0.     0.     0.    -3.803 -2.581  0.     0.     0.   ]
 [ 1.329  0.     3.121 -1.166  0.     2.77   0.     0.   ]
 [ 0.     0.     3.783  0.     0.    -1.005  0.     0.   ]
 [ 0.     0.    -0.355  0.     0.     2.63  -2.786  0.   ]
 [-1.091  0.     0.     0.     0.     0.    -1.651  0.   ]
 [ 0.     0.     0.     0.     0.     0.     0.    -2.495]
 [ 0.     0.     0.    -0.754  0.     0.     1.318  0.   ]
 [ 0.     5.346  0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.161  0.     0.     0.     0.     1.948  0.   ]
 [-2.29   0.     0.     0.     0.    -3.28   0.    -3.554]
 [ 0.     2.913  0.     0.    -0.786  0.746  0.     0.   ]
 [-0.377  0.     0.     0.     0.     0.009  0.     1.536]
 [ 0.     0.     0.     0.     0.    -2.109  0.     0.   ]
 [ 0.    -1.724 -0.489  0.    -3.839  0.     0.     0.   ]
 [ 0.     0.    -0.565  0.    -0.008  0.     1.134  0.   ]
 [ 0.     0.     0.     0.     0.    -3.122  0.     0.   ]
 [-1.537  0.     0.     0.    -4.171  0.     0.     0.   ]
 [-0.377  0.     0.     0.     1.792  0.    -2.11   0.   ]
 [ 0.     0.    -3.186  0.     0.     0.    -0.933  0.   ]
 [ 0.    -0.686 -0.17   0.     0.    -0.394  0.     0.   ]
 [ 0.     0.     0.     0.     0.     0.     0.    -2.493]
 [ 0.    -5.188  1.288 -2.773  0.    -1.354  0.     0.   ]
 [ 0.     0.     0.     4.39   0.     0.047  0.     0.   ]
 [ 0.     0.     0.     0.     0.198 -0.899  2.473 -0.197]
 [ 0.    -0.513  0.     0.    -0.284  0.    -0.896  0.   ]
 [ 4.382  0.     0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.     0.    -1.381 -1.8    0.   ]
 [-0.544  0.     0.     0.335  0.     0.     0.    -3.601]
 [ 0.007  0.     0.     1.815  0.     0.     0.     2.465]
 [-0.259  0.    -0.986  0.     0.     0.     1.006  0.   ]
 [ 0.     0.     0.     0.     0.     0.    -3.265  0.   ]
 [-0.638  0.     0.     0.     0.    -0.167  0.     0.   ]
 [ 0.     0.     0.     0.     0.    -1.611  0.     0.891]
 [ 0.     0.     0.     0.     0.     0.     0.    -0.05 ]
 [ 0.    -0.91   0.     4.087  0.    -1.783  0.     0.   ]
 [ 0.     1.626  2.917  0.     0.     0.     0.    -0.062]
 [-0.433  0.     2.14   0.1    0.297  0.    -1.942  1.76 ]
 [ 0.    -1.001  0.     0.    -0.076  2.827 -0.385  0.   ]
 [ 0.364  0.     0.     0.     0.    -5.764  0.     0.   ]
 [ 0.     1.986  0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.    -0.609  3.693  0.    -0.108  0.     0.   ]
 [ 0.    -2.211 -0.451  0.     0.     0.     0.     0.   ]
 [ 0.     0.     1.973  0.     0.     0.994  0.     2.599]
 [ 0.     0.     0.     0.     0.     0.     1.821  0.   ]
 [ 0.285  0.     0.     0.     0.     0.     0.278  0.   ]
 [ 0.     0.    -1.752 -1.165  0.     0.     0.    -0.153]
 [-0.217  0.     0.     0.895  0.     0.     2.225 -0.446]
 [-0.69   0.     0.     0.     3.864  0.    -0.028  0.   ]
 [ 0.     2.361  0.     0.     0.    -0.58  -1.289  1.004]
 [ 3.161  0.     0.146  0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     2.021  0.     0.     0.     0.   ]
 [ 2.525  0.     0.     0.     0.     0.     0.     0.   ]
 [ 0.    -4.034  0.     0.     0.     0.     0.    -3.498]
 [ 0.     2.128  0.     0.    -0.369  0.     4.337  0.   ]
 [ 0.     0.    -3.112  0.177  0.     1.31   0.     0.   ]
 [ 0.     0.941  0.889  0.    -0.317  0.     1.874  1.533]
 [-1.625  0.     0.528  0.    -3.392  0.     0.     0.   ]
 [ 0.     0.     0.     0.     2.966  0.     2.564  3.434]
 [ 0.     0.     0.     0.    -0.542  0.     0.737  0.   ]]
 

As you can see, we got out a large sparse matrix that we could then use to reconstruct our original data. Since our original data is random, there’s not a lot of fancy wizardry to show off at this point. But it’s easy to see how this could be useful on real problems. This kind of technique is used extensively for things like image processing, where your input is a high dimensional matrix, and reconstructing the output with a decently high degree of accuracy still yields something quite recognizable by people.

Painless, right?

Advertisements

2 thoughts on “Sparse Coding

  1. Shazia says:

    Hi , Thanks for short and concrete explanation about Sparse coding. Can we use Sparse Coding to upscale the feature dimension? I have tried your code on my data which is normalized version of image data of size (56084,450). As in your example, I used dimension =8 for my data. In the result, it generated (56084,8). But all data after 6th column becomes all zeros as in the example below. which parameters to tweak to prevent it to be zeros for all the last few columns?

    data= [[ 0.309 0.305 0.306 …, 0.301 0.301 0.301]
    [ 0.374 0.347 0.352 …, 0.301 0.301 0.301]
    [ 0.302 0.303 0.303 …, 0.301 0.301 0.301]
    …,
    [ 0.317 0.322 0.312 …, 0.301 0.301 0.301]
    [ 0.385 0.349 0.367 …, 0.301 0.301 0.301]
    [ 0.359 0.387 0.39 …, 0.301 0.301 0.301]]

    newcode= [[ 5.84 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]
    [ 6. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]
    [ 5.908 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]
    [ 5.897 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]
    [ 5.918 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]
    [ 5.93 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]
    [ 5.934 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]
    [ 5.841 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]
    [ 5.914 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    0. 0. 0. 0. ]]

    Thanks.
    Shazia.

  2. Gunjan NAik says:

    Nice tutorial,Earl. But, how to reconstruct back the original data or compressed version of original data using sparse codes?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Earl Bellinger

Earl Bellinger
%d bloggers like this: