Representation Learning A Review and New Perspectives

更新时间:2023-05-02 01:24:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Representation Learning:

A Review and New Perspectives

Yoshua Bengio,Aaron Courville,and Pascal Vincent

Abstract—The success of machine learning algorithms generally depends on data representation,and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data.

Although specific domain knowledge can be used to help design representations,learning with generic priors can also be used,and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors.This paper reviews recent work in the area of unsupervised feature learning and deep learning,covering advances in probabilistic models,autoencoders, manifold learning,and deep networks.This motivates longer term unanswered questions about the appropriate objectives for learning good representations,for computing representations(i.e.,inference),and the geometrical connections between representation learning,density estimation,and manifold learning.

Index Terms—Deep learning,representation learning,feature learning,unsupervised learning,Boltzmann machine,autoencoder, neural nets

?

1I NTRODUCTION

T HE performance of machine learning methods is heavily dependent on the choice of data representation(or features)on which they are applied.For that reason,much of the actual effort in deploying machine learning algo-rithms goes into the design of preprocessing pipelines and data transformations that result in a representation of the data that can support effective machine learning.Such feature engineering is important but labor intensive and highlights the weakness of current learning algorithms: Their inability to extract and organize the discriminative information from the data.Feature engineering is a way to take advantage of human ingenuity and prior knowledge to compensate for that weakness.To expand the scope and ease of applicability of machine learning,it would be highly desirable to make learning algorithms less dependent on feature engineering so that novel applications could be constructed faster,and more importantly,to make progress toward artificial intelligence(AI).An AI must fundamen-tally understand the world around us,and we argue that this can only be achieved if it can learn to identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data.

This paper is about representation learning,i.e.,learning representations of the data that make it easier to extract useful information when building classifiers or other predictors.In the case of probabilistic models,a good representation is often one that captures the posterior distribution of the underlying explanatory factors for the observed input.A good representation is also one that is useful as input to a supervised predictor.Among the various ways of learning representations,this paper focuses on deep learning methods:those that are formed by the composition of multiple nonlinear transformations with the goal of yielding more abstract—and ultimately more useful—representations.Here,we survey this rapidly developing area with special emphasis on recent progress. We consider some of the fundamental questions that have been driving research in this area.Specifically,what makes one representation better than another?Given an example, how should we compute its representation,i.e.,perform feature extraction?Also,what are appropriate objectives for learning good representations?

2W HY S HOULD W E C ARE ABOUT L EARNING R EPRESENTATIONS?

Representation learning has become a field in itself in the machine learning community,with regular workshops at the leading conferences such as NIPS and ICML,and a new conference dedicated to it,ICLR,1sometimes under the header of Deep Learning or Feature Learning.Although depth is an important part of the story,many other priors are interesting and can be conveniently captured when the problem is cast as one of learning a representation,as discussed in the next section.The rapid increase in scientific activity on representation learning has been accompanied and nourished by a remarkable string of empirical successes both in academia and in industry.Below,we briefly highlight some of these high points.

2.1Speech Recognition and Signal Processing Speech was one of the early applications of neural networks,in particular convolutional(or time-delay)neural

1798IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013 .The authors are with the Department of Computer Science and Operations

Research,Universite′de Montre′al,PO Box6128,Succ.Centre-Ville,

Montreal,Quebec H3C3J7,Canada.

Manuscript received9Apr.2012;revised17Oct.2012;accepted24Feb.2013;

published online28Feb.2013.

Recommended for acceptance by S.Bengio,L.Deng,f61efec4998fcc22bcd10d6frochelle,H.Lee,and

R.Salakhutdinov.

For information on obtaining reprints of this article,please send e-mail to:

tpami@f61efec4998fcc22bcd10d6f,and reference IEEECS Log Number

TPAMISI-2012-04-0260.

Digital Object Identifier no.10.1109/TPAMI.2013.50. 1.International Conference on Learning Representations.

0162-8828/13/$31.00?2013IEEE Published by the IEEE Computer Society

networks.2The recent revival of interest in neural networks, deep learning,and representation learning has had a strong impact in the area of speech recognition,with breakthrough results[54],[56],[183],[148],[55],[86]obtained by several academics as well as researchers at industrial labs bringing these algorithms to a larger scale and into products.For example,Microsoft released in2012a new version of their Microsoft Audio Video Indexing Service speech system based on deep learning[183].These authors managed to reduce the word error rate on four major benchmarks by about30percent(e.g.,from27.4to18.5percent on RT03S) compared to state-of-the-art models based on Gaussian mixtures for the acoustic modeling and trained on the same amount of data(309hours of speech).The relative improvement in error rate obtained by Dahl et al.[55]on a smaller large-vocabulary speech recognition benchmark (Bing mobile business search dataset,with40hours of

speech)is between16and23percent.

Representation-learning algorithms have also been ap-plied to music,substantially beating the state of the art in polyphonic transcription[34],with relative error improve-ment between5and30percent on a standard benchmark of four datasets.Deep learning also helped to win MIREX (music information retrieval)competitions,for example,in 2011on audio tagging[81].

2.2Object Recognition

The beginnings of deep learning in2006focused on the MNIST digit image classification problem[94],[23],break-ing the supremacy of SVMs(1.4percent error)on this dataset.3The latest records are still held by deep networks: Ciresan et al.[46]currently claim the title of state of the art for the unconstrained version of the task(e.g.,using a convolutional architecture),with0.27percent error,and Rifai et al.[169]is state of the art for the knowledge-free version of MNIST,with0.81percent error.

In the last few years,deep learning has moved from digits to object recognition in natural images,and the latest breakthrough has been achieved on the ImageNet dataset,4 bringing down the state-of-the-art error rate from26.1to 15.3percent[117].

2.3Natural Language Processing(NLP)

Besides speech recognition,there are many other NLP applications of representation learning.Distributed represen-tations for symbolic data were introduced by Hinton[87], and first developed in the context of statistical language modeling by Bengio et al.[19]in the so-called neural net language models[10].They are all based on learning a distributed representation for each word,called a word embedding.Adding a convolutional architecture,Collobert et al.[51]developed the SENNA system5that shares representations across the tasks of language modeling, part-of-speech tagging,chunking,named entity recognition, semantic role labeling,and syntactic parsing.SENNA approaches or surpasses the state of the art on these tasks, but is simpler and much faster than traditional predictors. Learning word embeddings can be combined with learning image representations in a way that allows associating text and images.This approach has been used successfully to build Google’s image search,exploiting huge quantities of data to map images and queries in the same space[218], and it has recently been extended to deeper multimodal representations[194].

The neural net language model was also improved by adding recurrence to the hidden layers[146],allowing it to beat the state of the art(smoothed n-gram models)not only in terms of perplexity(exponential of the average negative log likelihood of predicting the right next word, going down from140to102),but also in terms of word error rate in speech recognition(since the language model is an important component of a speech recognition system),decreasing it from17.2percent(KN5baseline) or16.9percent(discriminative language model)to 14.4percent on the Wall Street Journal benchmark task. Similar models have been applied in statistical machine translation[182],[121],improving perplexity and BLEU scores.Recursive autoencoders(which generalize recurrent networks)have also been used to beat the state of the art in full sentence paraphrase detection[192],almost dou-bling the F1score for paraphrase detection.Representation learning can also be used to perform word sense disambiguation[33],bringing up the accuracy from67.8 to70.2percent on the subset of Senseval-3where the system could be applied(with subject-verb-object sen-tences).Finally,it has also been successfully used to surpass the state of the art in sentiment analysis[70],[193].

2.4Multitask and Transfer Learning,Domain

Adaptation

Transfer learning is the ability of a learning algorithm to exploit commonalities between different learning tasks to share statistical strength and transfer knowledge across tasks. As discussed below,we hypothesize that representation learning algorithms have an advantage for such tasks because they learn representations that capture underlying factors,a subset of which may be relevant for each particular task,as illustrated in Fig. 1.This hypothesis seems confirmed by a number of empirical results showing

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES1799

Fig. 1.Illustration of representation-learning discovering explanatory

factors(middle hidden layer,in red),some explaining the input(semi-

supervised setting),and some explaining target for each task.Because

these subsets overlap,sharing of statistical strength helps generalization.

2.See[9]for a review of early work in this area.

3.For the knowledge-free version of the task where no image-specific

prior is used,such as image deformations or convolutions.

4.The1,000-class ImageNet benchmark,whose results are detailed here:

f61efec4998fcc22bcd10d6f/challenges/LSVRC/2012/results.

5.Downloadable from f61efec4998fcc22bcd10d6f/senna/.

the strengths of representation learning algorithms in transfer learning scenarios.

Most impressive are the two transfer learning challenges held in2011and won by representation learning algo-rithms.First,the transfer learning challenge,presented at an ICML2011workshop of the same name,was won using unsupervised layerwise pretraining[12],[145].A second transfer learning challenge was held the same year and won by Goodfellow et al.[72].Results were presented at NIPS2011’s Challenges in Learning Hierarchical Models Workshop.In the related domain adaptation setup,the target remains the same but the input distribution changes[70], [43].In the multitask learning setup,representation learning has also been found to be advantageous[117],[51]because of shared factors across tasks.

3W HAT M AKES A R EPRESENTATION G OOD?

3.1Priors for Representation Learning in AI

In[16],one of us introduced the notion of AI tasks,which are challenging for current machine learning algorithms and involve complex but highly structured dependencies. One reason why explicitly dealing with representations is interesting is because they can be convenient to express many general priors about the world around us,i.e.,priors that are not task specific but would be likely to be useful for a learning machine to solve AI tasks.Examples of such general-purpose priors are the following:

.Smoothness:Assumes that the function to be learned

f is s.t.x%y generally implies fexT%feyT.This

most basic prior is present in most machine learning,

but is insufficient to get around the curse of

dimensionality;see Section3.2.

.Multiple explanatory factors:The data generating distribution is generated by different underlying

factors,and for the most part,what one learns about

one factor generalizes in many configurations of the

other factors.The objective to recover or at least

disentangle these underlying factors of variation is

discussed in Section3.5.This assumption is behind

the idea of distributed representations,discussed in

Section3.3.

.A hierarchical organization of explanatory factors:The concepts that are useful for describing the world

around us can be defined in terms of other concepts,

in a hierarchy,with more abstract concepts higher in

the hierarchy,defined in terms of less abstract ones.

This assumption is exploited with deep representa-

tions,elaborated in Section3.4.

.Semi-supervised learning:With inputs X and target Y to predict,a subset of the factors explaining X’s

distribution explain much of Y,given X.Hence,

representations that are useful for PeXTtend to be

useful when learning PeY j XT,allowing sharing of

statistical strength between the unsupervised and

supervised learning tasks,see Section4.

.Shared factors across tasks:With many Y s of interest or many learning tasks in general,tasks(e.g.,the

corresponding PeY j X;taskT)are explained by

factors that are shared with other tasks,allowing

sharing of statistical strengths across tasks,as

discussed in the previous section(multitask and

transfer learning,domain adaptation).

.Manifolds:Probability mass concentrates near re-gions that have a much smaller dimensionality than

the original space where the data live.This is

explicitly exploited in some of the autoencoder

algorithms and other manifold-inspired algorithms

described,respectively,in Sections7.2and8.

.Natural clustering:Different values of categorical variables such as object classes are associated with

separate manifolds.More precisely,the local varia-

tions on the manifold tend to preserve the value of a

category,and a linear interpolation between exam-

ples of different classes in general involves going

through a low-density region,i.e.,PeX j Y?iTfor

different i tend to be well separated and not overlap

much.For example,this is exploited in the manifold

tangent classifier(MTC)discussed in Section8.3.

This hypothesis is consistent with the idea that

humans have named categories and classes because

of such statistical structure(discovered by their

brain and propagated by their culture),and machine

learning tasks often involve predicting such catego-

rical variables.

.Temporal and spatial coherence:Consecutive(from a sequence)or spatially nearby observations tend to be

associated with the same value of relevant catego-

rical concepts or result in a small move on the

surface of the high-density manifold.More gener-

ally,different factors change at different temporal

and spatial scales,and many categorical concepts of

interest change slowly.When attempting to capture

such categorical variables,this prior can be enforced

by making the associated representations slowly

changing,i.e.,penalizing changes in values over

time or space.This prior was introduced in[6]and is

discussed in Section11.3.

.Sparsity:For any given observation x,only a small fraction of the possible factors are relevant.In terms

of representation,this could be represented by

features that are often zero(as initially proposed

by Olshausen and Field[155]),or by the fact that

most of the extracted features are insensitive to small

variations of x.This can be achieved with certain

forms of priors on latent variables(peaked at0),or

by using a nonlinearity whose value is often flat at0

(i.e.,0and with a0derivative),or simply by

penalizing the magnitude of the Jacobian matrix(of

derivatives)of the function mapping input to

representation.This is discussed in Sections6.1.1

and7.2.

.Simplicity of factor dependencies:In good high-level representations,the factors are related to each other

through simple,typically linear dependencies.This

can be seen in many laws of physics and is assumed

when plugging a linear predictor on top of a learned

representation.

We can view many of the above priors as ways to help the learner discover and disentangle some of the underlying(and a priori unknown)factors of variation that the data may reveal.This idea is pursued further in Sections3.5and11.4.

1800IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013

3.2Smoothness and the Curse of Dimensionality For AI tasks such as vision and NLP,it seems hopeless to rely only on simple parametric models(such as linear models)because they cannot capture enough of the complexity of interest unless provided with the appropriate feature space.Conversely,machine learning researchers have sought flexibility in local6nonparametric learners such as kernel machines with a fixed generic local-response kernel (such as the Gaussian kernel).Unfortunately,as argued at length by Bengio and Monperrus[17],Bengio et al.[21], Bengio and LeCun[16],Bengio[11],and Bengio et al.[25], most of these algorithms only exploit the principle of local generalization,i.e.,the assumption that the target function(to be learned)is smooth enough,so they rely on examples to explicitly map out the wrinkles of the target function.General-ization is mostly achieved by a form of local interpolation between neighboring training examples.Although smooth-ness can be a useful assumption,it is insufficient to deal with the curse of dimensionality because the number of such wrinkles(ups and downs of the target function)may grow exponentially with the number of relevant interacting factors when the data are represented in raw input space. We advocate learning algorithms that are flexible and nonparametric,7but do not rely exclusively on the smooth-ness assumption.Instead,we propose to incorporate generic priors such as those enumerated above into representation-learning algorithms.Smoothness-based learners(such as kernel machines)and linear models can still be useful on top of such learned representations.In fact,the combination of learning a representation and kernel machine is equivalent to learning the kernel,i.e.,the feature space.Kernel machines are useful,but they depend on a prior definition of a suitable similarity metric or a feature space in which naive similarity metrics suffice.We would like to use the data,along with very generic priors,to discover those features or,equiva-lently,a similarity function.

3.3Distributed Representations

Good representations are expressive,meaning that a reason-ably sized learned representation can capture a huge number of possible input configurations.A simple counting argument helps us to assess the expressiveness of a model producing a representation:How many parameters does it require compared to the number of input regions(or configurations)it can distinguish?Learners of one-hot representations,such as traditional clustering algorithms, Gaussian mixtures,nearest-neighbor algorithms,decision trees,or Gaussian SVMs,all require OeNTparameters(and/ or OeNTexamples)to distinguish OeNTinput regions.One could naively believe that one cannot do better.However, restricted Boltzmann machines(RBMs),sparse coding, autoencoders,or multilayer neural networks can all represent up to Oe2kTinput regions using only OeNTparameters(with k the number of nonzero elements in a sparse representation,and k?N in nonsparse RBMs and other dense representations).These are all distributed8or sparse9representations.The generalization of clustering to distributed representations is multiclustering,where either several clusterings take place in parallel or the same clustering is applied on different parts of the input,such as in the very popular hierarchical feature extraction for object recognition based on a histogram of cluster categories detected in different patches of an image[120],[48].The exponential gain from distributed or sparse representations is discussed further in[11,Section3.2,Fig.3.2].It comes about because each parameter(e.g.,the parameters of one of the units in a sparse code or one of the units in an RBM)can be reused in many examples that are not simply near neighbors of each other,whereas with local generalization, different regions in input space are basically associated with their own private set of parameters,for example,as in decision trees,nearest neighbors,Gaussian SVMs,and so on.In a distributed representation,an exponentially large number of possible subsets of features or hidden units can be activated in response to a given input.In a single-layer model,each feature is typically associated with a preferred input direction,corresponding to a hyperplane in input space,and the code or representation associated with that input is precisely the pattern of activation(which features respond to the input,and how much).This is in contrast with a nondistributed representation such as the one learned by most clustering algorithms,for example, k-means,in which the representation of a given input vector is a one-hot code identifying which one of a small number of cluster centroids best represents the input.10 3.4Depth and Abstraction

Depth is a key aspect to the representation learning strategies we consider in this paper.As we will discuss,deep architectures are often challenging to train effectively,and this has been the subject of much recent research and progress.However,despite these challenges,they carry two significant advantages that motivate our long-term interest in discovering successful training strategies for deep architec-tures.These advantages are:1)deep architectures promote the reuse of features,and2)deep architectures can potentially lead to progressively more abstract features at higher layers of representations(more removed from the data).

Feature reuse.The notion of reuse,which explains the power of distributed representations,is also at the heart of the theoretical advantages behind deep learning,i.e.,

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES1801

6.Local in the sense that the value of the learned function at x depends mostly on training examples xetTs close to x.

7.We understand nonparametric as including all learning algorithms whose capacity can be increased appropriately as the amount of data and its complexity demands it,for example,including mixture models and neural networks where the number of parameters is a data-selected hyperparameter.

8.Distributed representations,where k out of N representation elements or feature values can be independently varied;for example, they are not mutually exclusive.Each concept is represented by having k features being turned on or active,while each feature is involved in representing many concepts.

9.Sparse representations:distributed representations where only a few of the elements can be varied at a time,i.e.,k

10.As discussed in[11],things are only slightly better when allowing continuous-valued membership values,for example,in ordinary mixture models(with separate parameters for each mixture component),but the difference in representational power is still exponential[149].The situation may also seem better with a decision tree,where each given input is associated with a one-hot code over the tree leaves,which deterministically selects associated ancestors(the path from root to node).Unfortunately,the number of different regions represented(equal to the number of leaves of the tree)still only grows linearly with the number of parameters used to specify it[15].

constructing multiple levels of representation or learning a hierarchy of features.The depth of a circuit is the length of the longest path from an input node of the circuit to an output node of the circuit.The crucial property of a deep circuit is that its number of paths,i.e.,ways to reuse different parts,can grow exponentially with its depth.Formally,one can change the depth of a given circuit by changing the definition of what each node can compute,but only by a constant factor.The typical computations we allow in each node include:weighted sum,product,artificial neuron model(such as a monotone nonlinearity on top of an affine transformation),computation of a kernel,or logic gates.Theoretical results clearly show families of functions where a deep representation can be exponentially more efficient than one that is insufficiently deep[82],[83],[21], [16],[15].If the same family of functions can be represented with fewer parameters(or more precisely with a smaller VC dimension),learning theory would suggest that it can be learned with fewer examples, yielding improvements in both computational efficiency (less nodes to visit)and statistical efficiency(fewer parameters to learn,and reuse of these parameters over many different kinds of inputs).

Abstraction and invariance.Deep architectures can lead to abstract representations because more abstract concepts can often be constructed in terms of less abstract ones.In some cases,such as in the convolutional neural network[133],we explicitly build this abstraction in via a pooling mechanism (see Section11.2).More abstract concepts are generally invariant to most local changes of the input.That makes the representations that capture these concepts generally highly nonlinear functions of the raw input.This is obviously true of categorical concepts,where more abstract representations detect categories that cover more varied phenomena(e.g., larger manifolds with more wrinkles),and thus they potentially have greater predictive power.Abstraction can also appear in high-level continuous-valued attributes that are only sensitive to some very specific types of changes in the input.Learning these sorts of invariant features has been a long-standing goal in pattern recognition.

3.5Disentangling Factors of Variation

Beyond being distributed and invariant,we would like our representations to disentangle the factors of variation.Different explanatory factors of the data tend to change indepen-dently of each other in the input distribution,and only a few at a time tend to change when one considers a sequence of consecutive real-world inputs.

Complex data arise from the rich interaction of many sources.These factors interact in a complex web that can complicate AI-related tasks such as object classification.For example,an image is composed of the interaction between one or more light sources,the object shapes,and the material properties of the various surfaces present in the image.Shadows from objects in the scene can fall on each other in complex patterns,creating the illusion of object boundaries where there are none,and can dramatically effect the perceived object shape.How can we cope with these complex interactions?How can we disentangle the objects and their shadows?Ultimately,we believe the approach we adopt for overcoming these challenges must leverage the data itself,using vast quantities of unlabeled examples,to learn representations that separate the various explanatory sources.Doing so should give rise to a representation significantly more robust to the complex and richly structured variations extant in natural data sources for AI-related tasks.

It is important to distinguish between the related but distinct goals of learning invariant features and learning to disentangle explanatory factors.The central difference is the preservation of information.Invariant features,by definition,have reduced sensitivity in the direction of invariance.This is the goal of building features that are insensitive to variation in the data that are uninformative to the task at hand.Unfortunately,it is often difficult to determine a priori which set of features and variations will ultimately be relevant to the task at hand.Further,as is often the case in the context of deep learning methods,the feature set being trained may be destined to be used in multiple tasks that may have distinct subsets of relevant features.Considerations such as these lead us to the conclusion that the most robust approach to feature learning is to disentangle as many factors as possible,discarding as little information about the data as is practical.If some form of dimensionality reduction is desirable,then we hypothe-size that the local directions of variation least represented in the training data should be first to be pruned out(as in principal components analysis(PCA),for example,which does it globally instead of around each example).

3.6Good Criteria for Learning Representations? One of the challenges of representation learning that distinguishes it from other machine learning tasks such as classification is the difficulty in establishing a clear objective or target for training.In the case of classification, the objective is(at least conceptually)obvious;we want to minimize the number of misclassifications on the training dataset.In the case of representation learning,our objective is far removed from the ultimate objective,which typically is learning a classifier or some other predictor.Our problem is reminiscent of the credit assignment problem encountered in reinforcement learning.We have proposed that a good representation is one that disentangles the underlying factors of variation,but how do we translate that into appropriate training criteria?Is it even necessary to do anything but maximize likelihood under a good model or can we introduce priors such as those enumer-ated above(possibly data-dependent ones)that help the representation better do this disentangling?This question remains clearly open but is discussed in more detail in Sections3.5and11.

4.

4B UILDING D EEP R EPRESENTATIONS

In2006,a breakthrough in feature learning and deep learning was initiated by Geoff Hinton and quickly followed up in the same year[94],[23],[161],and soon after by Lee et al.[134]and many more later.It has been extensively reviewed and discussed in[11].A central idea, referred to as greedy layerwise unsupervised pretraining,was to learn a hierarchy of features one level at a time,using unsupervised feature learning to learn a new transforma-tion at each level to be composed with the previously learned transformations;essentially,each iteration of

1802IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013

unsupervised feature learning adds one layer of weights to a deep neural network.Finally,the set of layers could be combined to initialize a deep supervised predictor,such as a neural network classifier,or a deep generative model, such as a deep Boltzmann machine(DBM)[176].

This paper is mostly about feature learning algorithms that can be used to form deep architectures.In particular,it was empirically observed that layerwise stacking of feature extraction often yielded better representations,for example, in terms of classification error[119],[65],quality of the samples generated by a probabilistic model[176],or in terms of the invariance properties of the learned features [71].Whereas this section focuses on the idea of stacking single-layer models,Section10follows up with a discussion on joint training of all the layers.

After greedy layerwise unsupervised pretraining,the resulting deep features can be used either as input to a standard supervised machine learning predictor(such as an SVM)or as initialization for a deep supervised neural network(e.g.,by appending a logistic regression layer or purely supervised layers of a multilayer neural network). The layerwise procedure can also be applied in a purely supervised setting,called the greedy layerwise supervised pretraining[23].For example,after the first one-hidden-layer multilayer perceptrons(MLP)is trained,its output layer is discarded and another one-hidden-layer MLP can be stacked on top of it,and so on.Although results reported in[23]were not as good as for unsupervised pretraining, they were nonetheless better than without pretraining at all. Alternatively,the outputs of the previous layer can be fed as extra inputs for the next layer(in addition to the raw input), as successfully done in[221].Another variant[184] pretrains in a supervised way all the previously added layers at each step of the iteration,and in their experiments, this discriminant variant yielded better results than un-supervised pretraining.

Whereas combining single layers into a supervised model is straightforward,it is less clear how layers pretrained by unsupervised learning should be combined to form a better unsupervised model.We cover here some of the approaches to do so,but no clear winner emerges,and much work has to be done to validate existing proposals or improve them.

The first proposal was to stack pretrained RBMs into a deep belief network[94]or DBN,where the top layer is interpreted as an RBM and the lower layers as a directed sigmoid belief network.However,it is not clear how to approximate maximum likelihood training to further optimize this generative model.One option is the wake-sleep algorithm[94],but more work should be done to assess the efficiency of this procedure in terms of improving the generative model.

The second approach that has been put forward is to combine the RBM parameters into a DBM by basically halving the RBM weights to obtain the DBM weights[176]. The DBM can then be trained by approximate maximum likelihood as discussed in more detail later(Section10.2). This joint training has brought substantial improvements, both in terms of likelihood and in terms of classification performance of the resulting deep feature learner[176].

Another early approach was to stack RBMs or autoenco-ders into a deep autoencoder[92].If we have a series of encoder-decoder pairsefeiTeáT;geiTeáTT,then the overall encoder is the composition of the encoders,feNTe...fe2Tefe1TeáTTT,and the overall decoder is its“transpose”(often with transposed weight matrices as well),ge1Tege2Te...feNTeáTTT.The deep autoencoder(or its regularized version,as discussed in Section7.2)can then be jointly trained,with all the parameters optimized with respect to a global reconstruction error criterion.More work on this avenue clearly needs to be done,and it was probably avoided by fear of the challenges in training deep feedforward networks,discussed in Section10 along with very encouraging recent results.

Yet another recently proposed approach to training deep architectures[154]is to consider the iterative construction of a free energy function(i.e.,with no explicit latent variables, except possibly for a top-level layer of hidden units)for a deep architecture as the composition of transformations associated with lower layers,followed by top-level hidden units.The question is then how to train a model defined by an arbitrary parameterized(free)energy function.Ngiam et al.[154]have used hybrid Monte Carlo[153],but other options include contrastive divergence(CD)[88],[94],score matching[97],[99],denoising score matching[112],[210], ratio matching[98],and noise-contrastive estimation[80]. 5S INGLE-L AYER L EARNING M ODULES

Within the community of researchers interested in repre-sentation learning,there has developed two broad parallel lines of inquiry:one rooted in probabilistic graphical models and one rooted in neural networks.Fundamentally, the difference between these two paradigms is whether the layered architecture of a deep learning model is to be interpreted as describing a probabilistic graphical model or as describing a computation graph.In short,are hidden units considered latent random variables or as computa-tional nodes?

To date,the dichotomy between these two paradigms has remained in the background,perhaps because they appear to have more characteristics in common than separating them.We suggest that this is likely a function of the fact that much recent progress in both of these areas has focused on single-layer greedy learning modules and the similarities between the types of single-layer models that have been explored:mainly,the RBM on the probabilistic side and the autoencoder variants on the neural network side.Indeed,as shown by one of us[210]and others[200], in the case of the RBM,training the model via an inductive principle known as score matching[97](to be discussed in Section6.4.3)is essentially identical to applying a regular-ized reconstruction objective to an autoencoder.Another strong link between pairs of models on both sides of this divide is when the computational graph for computing representation in the neural network model corresponds exactly to the computational graph that corresponds to inference in the probabilistic model,and this happens to also correspond to the structure of graphical model itself (e.g.,as in the RBM).

The connection between these two paradigms becomes more tenuous when we consider deeper models,where in

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES1803

the case of a probabilistic model,exact inference typically becomes intractable.In the case of deep models,the computational graph diverges from the structure of the model.For example,in the case of a DBM,unrolling variational(approximate)inference into a computational graph results in a recurrent graph structure.We have performed preliminary exploration[179]of deterministic variants of deep autoencoders whose computational graph is similar to that of a DBM(in fact very close to the mean-field variational approximations associated with the Boltzmann machine),and that is one interesting inter-mediate point to explore(between the deterministic approaches and the graphical model approaches).

In the next few sections,we will review the major developments in single-layer training modules used to support feature learning and particularly deep learning.We divide these sections between(Section6)the probabilistic models,with inference and training schemes that directly parameterize the generative—or decoding—pathway and (Section7)the typically neural network-based models that directly parametrize the encoding pathway.Interestingly, some models like predictive sparse decomposition(PSD) [109]inherit both properties and will also be discussed (Section7.2.4).We then present a different view of representation learning,based on the associated geometry and the manifold assumption,in Section8.

First,let us consider an unsupervised single-layer representation learning algorithm spanning all three views: probabilistic,autoencoder,and manifold learning.

5.1PCA

We will use probably the oldest feature extraction algo-rithm,PCA,to illustrate the probabilistic,autoencoder,and manifold views of representation learning.PCA learns a linear transformation h?fexT?W T xtb of input x2I R d x, where the columns of d x?d h matrix W form an orthogonal basis for the d h orthogonal directions of greatest variance in the training data.The result is d h features(the components of representation h)that are decorrelated.The three interpretations of PCA are the following:1)It is related to probabilistic models(Section6)such as probabilistic PCA, factor analysis,and the traditional multivariate Gaussian distribution(the leading eigenvectors of the covariance matrix are the principal components);2)the representation it learns is essentially the same as that learned by a basic linear autoencoder(Section7.2);and3)it can be viewed as a simple linear form of linear manifold learning(Section8),i.e., characterizing a lower dimensional region in input space near which the data density is peaked.Thus,PCA may be in the back of the reader’s mind as a common thread relating these various viewpoints.Unfortunately,the expressive power of linear features is very limited:They cannot be stacked to form deeper,more abstract representations since the composition of linear operations yields another linear operation.Here,we focus on recent algorithms that have been developed to extract nonlinear features,which can be stacked in the construction of deep networks,although some authors simply insert a nonlinearity between learned single-layer linear projections[125],[43].

Another rich family of feature extraction techniques that this review does not cover in any detail due to space constraints is independent component analysis or ICA [108],[8].Instead,we refer the reader to[101],[103].Note that while in the simplest case(complete,noise free)ICA yields linear features,in the more general case it can be equated with a linear generative model with non-Gaussian independent latent variables,similar to sparse coding (Section6.1.1),which result in nonlinear features.Therefore, ICA and its variants like independent and topographic ICA [102]can and have been used to build deep networks[122], [125](see Section11.2).The notion of obtaining indepen-dent components also appears similar to our stated goal of disentangling underlying explanatory factors through deep networks.However,for complex real-world distributions,it is doubtful that the relationship between truly independent underlying factors and the observed high-dimensional data can be adequately characterized by a linear transformation. 6P ROBABILISTIC M ODELS

From the probabilistic modeling perspective,the question of feature learning can be interpreted as an attempt to recover a parsimonious set of latent random variables that describe a distribution over the observed data.We can express as pex;hTa probabilistic model over the joint space of the latent variables,h,and observed data or visible variables x.Feature values are conceived as the result of an inference process to determine the probability distribution of the latent variables given the data,i.e.,peh j xT,often referred to as the posterior probability.Learning is conceived in terms of estimating a set of model parameters that (locally)maximize the regularized likelihood of the training data.The probabilistic graphical model formalism gives us two possible modeling paradigms in which we can consider the question of inferring latent variables,directed and undirected graphical models,which differ in their para-meterization of the joint distribution pex;hT,yielding major impact on the nature and computational costs of both inference and learning.

6.1Directed Graphical Models

Directed latent factor models separately parameterize the conditional likelihood pex j hTand the prior pehTto construct the joint distribution,pex;hT?pex j hTpehT.Examples of this decomposition include:PCA[171],[206],sparse coding [155],sigmoid belief networks[152],and the newly introduced spike-and-slab sparse coding(S3C)model[72].

6.1.1Explaining Away

Directed models often lead to one important property: explaining away,i.e.,a priori independent causes of an event can become nonindependent given the observation of the f61efec4998fcc22bcd10d6ftent factor models can generally be interpreted as latent cause models,where the h activations cause the observed x.This renders the a priori independent h to be nonindependent.As a consequence,recovering the posterior distribution of h,peh j xT(which we use as a basis for feature representation),is often computationally challenging and can be entirely intractable,especially when h is discrete.

A classic example that illustrates the phenomenon is to imagine you are on vacation away from home and you receive a phone call from the security system company

1804IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013

telling you that the alarm has been activated.You begin worrying your home has been burglarized,but then you hear on the radio that a minor earthquake has been reported in the area of your home.If you happen to know from prior experience that earthquakes sometimes cause your home alarm system to activate,then suddenly you relax,con-fident that your home has very likely not been burglarized.

The example illustrates how the alarm activation rendered two otherwise entirely independent causes,burglarized and earthquake,to become dependent—in this case,the depen-dency is one of mutual exclusivity.Since both burglarized and earthquake are very rare events and both can cause alarm activation,the observation of one explains away the other. Despite the computational obstacles we face when attempt-ing to recover the posterior over h,explaining away promises to provide a parsimonious peh j xT,which can be an extremely useful characteristic of a feature encoding scheme.If one thinks of a representation as being composed of various feature detectors and estimated attributes of the observed input,it is useful to allow the different features to compete and collaborate with each other to explain the input.This is naturally achieved with directed graphical models,but can also be achieved with undirected models (see Section6.2)such as Boltzmann machines if there are lateral connections between the corresponding units or corresponding interaction terms in the energy function that defines the probability model.

Probabilistic interpretation of PCA.PCA can be given a natural probabilistic interpretation[171],[206]as factor analysis:

pehT?N à

h;0; 2

h

I

á

;

pex j hT?N à

x;Wht x; 2

x

I

á

;

e1T

where x2I R d x,h2I R d h,Nev; ;?Tis the multivariate normal density of v with mean and covariance?,and columns of W span the same space as leading d h principal components,but are not constrained to be orthonormal.

Sparse coding.Like PCA,sparse coding has both a probabilistic and nonprobabilistic interpretation.Sparse coding also relates a latent representation h(either a vector of random variables or a feature vector,depending on the interpretation)to the data x through a linear mapping W, which we refer to as the dictionary.The difference between sparse coding and PCA is that sparse coding includes a penalty to ensure a sparse activation of h is used to encode each input x.From a nonprobabilistic perspective,sparse coding can be seen as recovering the code or feature vector associated with a new input x via

h??fexT?argmin

h k xàWh

2

2

t

h k

1

:e2T

Learning the dictionary W can be accomplished by optimiz-ing the following training criterion with respect to W:

J SC?

X

t

k xetTàWh?etTk22;e3T

where xetTis the t th example and h?etTis the corresponding sparse code determined by(2).W is usually constrained to have unit-norm columns(because one can arbitrarily exchange scaling of column i with scaling of hetTi,such a constraint is necessary for the L1penalty to have any effect).

The probabilistic interpretation of sparse coding differs from that of PCA in that instead of a Gaussian prior on the latent random variable h,we use a sparsity inducing Laplace prior(corresponding to an L1penalty):

pehT?

Y d h

i

2

expeà j h i jT;

pex j hT?N

à

x;Wht x; 2

x

I

á

:

e4T

In the case of sparse coding,because we will seek a sparse representation(i.e.,one with many features set to exactly zero),we will be interested in recovering the maximum a posteriori value(MAP value of h,i.e.,h??argmax h peh j xTrather than its expected value I E?h j x ).Under this interpretation,dictionary learning proceeds as maximizing the likelihood of the data given these MAP values of h?: argmax W

Q

t

pexetTj h?etTTsubject to the norm constraint on W.Note that this parameter learning scheme,subject to the MAP values of the latent h,is not standard practice in the probabilistic graphical model literature.Typically,the likelihood of the data pexT?

P

h

pex j hTpehTis maximized directly.In the presence of latent variables,expectation maximization is employed,where the parameters are optimized with respect to the marginal likelihood,i.e., summing or integrating the joint log likelihood over all values of the latent variables under their posterior Peh j xT, rather than considering only the single MAP value of h. The theoretical properties of this form of parameter learning are not yet well understood but seem to work well in practice(e.g.,k-means versus Gaussian mixture models and Viterbi training for HMMs).Note also that the interpretation of sparse coding as a MAP estimation can be questioned[77]because,even though the interpretation of the L1penalty as a log prior is a possible interpretation, there can be other Bayesian interpretations compatible with the training criterion.

Sparse coding is an excellent example of the power of explaining away.Even with a very overcomplete diction-ary,11the MAP inference process used in sparse coding to find h?can pick out the most appropriate bases and zero the others,despite them having a high degree of correla-tion with the input.This property arises naturally in directed graphical models such as sparse coding and is entirely due to the explaining away effect.It is not seen in commonly used undirected probabilistic models such as the RBM,nor is it seen in parametric feature encoding methods such as autoencoders.The tradeoff is that compared to methods such as RBMs and autoencoders, inference in sparse coding involves an extra inner loop of optimization to find h?with a corresponding increase in the computational cost of feature f61efec4998fcc22bcd10d6fpared to autoencoders and RBMs,the code in sparse coding is a free variable for each example,and in that sense,the implicit encoder is nonparametric.

One might expect that the parsimony of the sparse coding representation and its explaining away effect would be advantageous and indeed it seems to be the case.Coates

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES1805

11.Overcomplete:with more dimensions of h than dimensions of x.

and Ng[48]demonstrated on the CIFAR-10object classification task[116]with a patch-base feature extraction pipeline that in the regime with few(<1;000)labeled training examples per class,the sparse coding representa-tion significantly outperformed other highly competitive encoding schemes.Possibly because of these properties and because of the very computationally efficient algorithms that have been proposed for it(in comparison with the general case of inference in the presence of explaining away),sparse coding enjoys considerable popularity as a feature learning and encoding paradigm.There are numer-ous examples of its successful application as a feature representation scheme,including natural image modeling [159],[109],[48],[224],audio classification[78],NLP[4],as well as being a very successful model of the early visual cortex[155].Sparsity criteria can also be generalized successfully to yield groups of features that prefer to all be zero,but if one or a few of them are active,then the penalty for activating others in the group is small.Different group sparsity patterns can incorporate different forms of prior knowledge[110],[107],[3],[76].

S3C.S3C is one example of a promising variation on sparse coding for feature learning[73].The S3C model possesses a set of latent binary spike variables together with a set of latent real-valued slab variables.The activation of the spike variables dictates the sparsity pattern.S3C has been applied to the CIFAR-10and CIFAR-100object classification tasks[116],and shows the same pattern as sparse coding of superior performance in the regime of relatively few(<1;000)labeled examples per class[73].In fact,in both the CIFAR-100dataset(with500examples per class)and the CIFAR-10dataset(when the number of examples is reduced to a similar range),the S3C representa-tion actually outperforms sparse coding representations. This advantage was revealed clearly with S3C winning the NIPS’11Transfer Learning Challenge[72].

6.2Undirected Graphical Models

Undirected graphical models,also called Markov random fields(MRFs),parameterize the joint pex;hTthrough a product of unnormalized nonnegative clique potentials:

pex;hT?

1

Z

Y

i

i

exT

Y

j

jehT

Y

k

kex;hT;e5T

where iexT, jehT,and kex;hTare the clique potentials describing the interactions between the visible elements, between the hidden variables,and those interaction between the visible and hidden variables,respectively. The partition function Z ensures that the distribution is normalized.Within the context of unsupervised feature learning,we generally see a particular form of MRF called a Boltzmann distribution with clique potentials constrained to be positive:

pex;hT?

1

Z

expàE ex;hT

eT;e6T

where E ex;hTis the energy function and contains the interactions described by the MRF clique potentials and are the model parameters that characterize these interactions.

The Boltzmann machine was originally defined as a network of symmetrically coupled binary random variables or units.These stochastic units can be divided into two groups:1)the visible units x2f0;1g d x that represent the data,and2)the hidden or latent units h2f0;1g d h that mediate dependencies between the visible units through their mutual interactions.The pattern of interaction is specified through the energy function:

E BM

ex;hT?à

1

2

x T Uxà

1

2

h T V hàx T Whàb T xàd T h;e7Twhere ?f U;V;W;b;d g are the model parameters that, respectively,encode the visible-to-visible interactions,the hidden-to-hidden interactions,the visible-to-hidden inter-actions,the visible self-connections,and the hidden self-connections(called biases).To avoid overparameterization, the diagonals of U and V are set to zero.

The Boltzmann machine energy function specifies the probability distribution over?x;h ,via the Boltzmann distribution,(6),with the partition function Z given by

Z ?

X

x1?1

x1?0

ááá

X

x d x?1

x d x?0

X

h1?1

h1?0

ááá

X

h d

h

?1

h d

h

?0

exp

à

àE BM

ex;h; T

á

:e8T

This joint probability distribution gives rise to the set of conditional distributions of the form:

Peh i j x;h n iT?sigmoid

X

j

W ji x jt

X

i0?i

V ii0h i0td i

;e9TPex j j h;x n jT?sigmoid

X

i

W ji x jt

X

j0?j

U jj0x j0tb j

:e10T

In general,inference in the Boltzmann machine is intract-able.For example,computing the conditional probability of h i given the visibles,Peh i j xT,requires marginalizing over the rest of the hiddens,which implies evaluating a sum with2d hà1terms:

Peh i j xT?

X

h1?1

h1?0

ááá

X

h ià1?1

h ià1?0

X

h it1?1

h it1?0

ááá

X

h d

h

?1

h d

h

?0

Peh j xT:e11T

However,with some judicious choices in the pattern of interactions between the visible and hidden units,more tractable subsets of the model family are possible,as we discuss next.

RBMs.The RBM is likely the most popular subclass of Boltzmann machine[190].It is defined by restricting the interactions in the Boltzmann energy function,in(7),to only

those between h and x,i.e.,E RBM

is E BM with U?0and V?0.As such,the RBM can be said to form a bipartite graph with the visibles and the hiddens forming two layers of vertices in the graph(and no connection between units of the same layer).With this restriction,the RBM possesses the useful property that the conditional distribution over the hidden units factorizes given the visibles:

Peh j xT?

Y

i

Peh i j xT;

Peh i?1j xT?sigmoid

X

j

W ji x jtd i

:

e12T

1806IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013

Likewise,the conditional distribution over the visible units given the hiddens also factorizes:

Pex j hT?

Y

j

Pex j j hT;

Pex j?1j hT?sigmoid

X

i W ji h itb j

:

e13T

This makes inferences readily tractable in RBMs.For example,the RBM feature representation is taken to be the set of posterior marginals Peh i j xTwhich,given the conditional independence described in(12),are immedi-ately available.Note that this is in stark contrast to the situation with popular directed graphical models for unsupervised feature extraction,where computing the posterior probability is intractable.

Importantly,the tractability of the RBM does not extend to its partition function,which still involves summing an exponential number of terms.It does imply,however,that we can limit the number of terms to min f2d x;2d h f61efec4998fcc22bcd10d6fually, this is still an unmanageable number of terms,and therefore we must resort to approximate methods to deal with its estimation.

It is difficult to overstate the impact the RBM has had to the fields of unsupervised feature learning and deep learning.It has been used in a truly impressive variety of applications,including fMRI image classification[180], motion and spatial transformations[201],[144],collabora-tive filtering[178],and natural image modeling[160],[53].

6.3Generalizations of the RBM to Real-Valued Data Important progress has been made in the last few years in defining generalizations of the RBM that better capture real-valued data,in particular real-valued image data,by better modeling the conditional covariance of the input pixels.The standard RBM,as discussed above,is defined with both binary visible variables v2f0;1g and binary latent vari-ables h2f0;1g.The tractability of inference and learning in the RBM has inspired many authors to extend it,via modifications of its energy function,to model other kinds of data distributions.In particular,there have been multiple attempts to develop RBM-type models of real-valued data, where x2I R d x.The most straightforward approach to modeling real-valued observations within the RBM frame-work is the so-called Gaussian RBM(GRBM),where the only change in the RBM energy function is to the visible units biases,by adding a bias term that is quadratic in the visible units x.While it probably remains the most popular way to model real-valued data within the RBM framework, Ranzato and Hinton[160]suggest that the GRBM has proven to be a somewhat unsatisfactory model of natural images.The trained features typically do not represent sharp edges that occur at object boundaries and lead to latent representations that are not particularly useful features for classification tasks.Ranzato and Hinton[160] argue that the failure of the GRBM to adequately capture the statistical structure of natural images stems from the exclusive use of the model capacity to capture the conditional mean at the expense of the conditional covariance.Natural images,they argue,are chiefly char-acterized by the covariance of the pixel values,not by their absolute values.This point is supported by the common use of preprocessing methods that standardize the global scaling of the pixel values across images in a dataset or across the pixel values within each image.

These kinds of concerns about the ability of the GRBM to model natural image data has led to the development of alternative RBM-based models that each attempt to take on this objective of better modeling nondiagonal conditional covariances.Ranzato and Hinton[160]introduced the mean and covariance RBM(mcRBM).Like the GRBM,the mcRBM is a two-layer Boltzmann machine that explicitly models the visible units as Gaussian distributed quantities.How-ever,unlike the GRBM,the mcRBM uses its hidden layer to independently parametrize both the mean and covariance of the data through two sets of hidden units.The mcRBM is a combination of the covariance RBM(cRBM)[163], which models the conditional covariance,and the GRBM, which captures the conditional mean.While the GRBM has shown considerable potential as the basis of a highly successful phoneme recognition system[54],it seems that due to difficulties in training the mcRBM,the model has been largely superseded by the mPoT model.The mPoT model(mean-product of Student’s T-distributions model)[164] is a combination of the GRBM and the product of Student’s T-distributions model[216].It is an energy-based model where the conditional distribution over the visible units conditioned on the hidden variables is a multivariate Gaussian(nondiagonal covariance)and the complementary conditional distribution over the hidden variables given the visibles are a set of independent Gamma distributions.

The PoT model has recently been generalized to the mPoT model[164]to include nonzero Gaussian means by the addition of GRBM-like hidden units,similarly to how the mcRBM generalizes the cRBM.The mPoT model has been used to synthesize large-scale natural images[164]that show large-scale features and shadowing structure.It has been used to model natural textures[113]in a tiled-convolution configuration(see Section11.2).

Another recently introduced RBM-based model with the objective of having the hidden units encode both the mean and covariance information is the spike-and-slab restricted Boltzmann machine(ssRBM)[52],[53].The ssRBM is defined as having both a real-valued“slab”variable and a binary“spike”variable associated with each unit in the hidden layer.The ssRBM has been demonstrated as a feature learning and extraction scheme in the context of CIFAR-10object classification[116]from natural images and has performed well in the role[52],[53].When trained convolutionally(see Section11.2)on full CIFAR-10natural images,the model demonstrated the ability to generate natural image samples that seem to capture the broad statistical structure of natural images better than previous parametric generative models,as illustrated with the samples of Fig.2.

The mcRBM,mPoT,and ssRBM each set out to model real-valued data such that the hidden units encode not only the conditional mean of the data but also its conditional covariance.Other than differences in the training schemes, the most significant difference between these models is how they encode their conditional covariance.While the mcRBM and the mPoT use the activation of the hidden units to

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES1807

enforce constraints on the covariance of x,the ssRBM uses the hidden unit to pinch the precision matrix along the direction specified by the corresponding weight vector. These two ways of modeling conditional covariance diverge when the dimensionality of the hidden layer is significantly different from that of the input.In the overcomplete setting,sparse activation with the ssRBM parametrization permits variance only in the select direc-tions of the sparsely activated hidden units.This is a property the ssRBM shares with sparse coding models [155],[78].On the other hand,in the case of the mPoT or mcRBM,an overcomplete set of constraints on the covar-iance implies that capturing arbitrary covariance along a particular direction of the input requires decreasing potentially all constraints with positive projection in that direction.This perspective would suggest that the mPoT and mcRBM do not appear to be well suited to provide a sparse representation in the overcomplete setting.

6.4RBM Parameter Estimation

Many of the RBM training methods we discuss here are applicable to more general undirected graphical models, but are particularly practical in the RBM setting.Freund and Haussler[66]proposed a learning algorithm for harmoniums(RBMs)based on projection pursuit.CD[88], [94]has been used most often to train RBMs,and many recent papers use stochastic maximum likelihood(SML) [220],[204].

As discussed in Section 6.1,in training probabilistic models,parameters are typically adapted to maximize the likelihood of the training data(or equivalently the log likelihood,or its penalized version,which adds a regular-ization term).With T training examples,the log likelihood is given by

X T

t?1

log P

à

xetT;

á

?

X T

t?1

log

X

h2f0;1g d h

P

à

xetT;h;

á

:e14T

Gradient-based optimization requires its gradient,which for Boltzmann machines is given by

@

@ i

X T

t?1

log pexetTT?à

X T

t?1

I E peh j xetTT

@

@ i

E BM

exetT;hT

!

t

X T

t?1

I E pex;hT

@

@ i

E BM

ex;hT

!

;

e15T

where we have the expectations with respect to pehetTj xetTTunder the“clamped”condition(also called the positive phase)and over the full joint pex;hTunder the“unclamped”condition(also called the negative phase).Intuitively,the gradient acts to locally move the model distribution(the negative phase distribution)toward the data distribution (positive phase distribution)by pushing down the energy of eh;xetTTpairs(for h$Peh j xetTT)while pushing up the energy ofeh;xTpairs(foreh;xT$Peh;xT)until the two forces are in equilibrium,at which point the sufficient statistics(gradient of the energy function)have equal expectations with x sampled from the training distribution or with x sampled from the model.

The RBM conditional independence properties imply that the expectation in the positive phase of(15)is tractable.The negative phase term—arising from the partition function’s contribution to the log-likelihood gradient—is more problematic because the computation of the expectation over the joint is not tractable.The various ways of dealing with the partition function’s contribution to the gradient have brought about a number of different training algorithms,many trying to approx-imate the log-likelihood gradient.

To approximate the expectation of the joint distribution in the negative phase contribution to the gradient,it is natural to again consider exploiting the conditional in-dependence of the RBM to specify a Monte Carlo approx-imation of the expectation over the joint:

I E pex;hT

@

@ i

E RBM

ex;hT

!

%

1

L

X L

l?1

@

@ i

E RBM

à

~xelT;~helT

á

;e16T

with the samplese~xelT;~helTTdrawn by a block Gibbs Markov chain Monte Carlo(MCMC)sampling procedure:

~xelT$Pex j~helà1TT;

~helT$Peh j~xelTT:

Naively,for each gradient update step one would start a Gibbs sampling chain,wait until the chain converges to the equilibrium distribution,and then draw a sufficient number of samples to approximate the expected gradient with respect to the model(joint)distribution in(16).Then,restart the process for the next step of approximate gradient ascent on the log-likelihood.This procedure has the obvious flaw that waiting for the Gibbs chain to“burn-in”and reach equilibrium anew for each gradient update cannot form the basis of a practical training algorithm.CD[88],[94],SML [220],[204],and fast-weights persistent CD or fast-weights

1808IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST

2013 Fig. 2.(Top)Samples from convolutionally trained -ssRBM from

[53].(Bottom)Images in the CIFAR-10training set closest

(L2distance with contrast normalized training images)to correspond-

ing model samples on top.The model does not appear to be

overfitting particular training examples.

persistent contrastive divergence(FPCD)[205]are all ways to avoid or reduce the need for burn in.

6.4.1CD

CD estimation[88],[94]estimates the negative phase expectation(15)with a very short Gibbs chain(often just one step)initialized at the training data used in the positive phase.This reduces the variance of the gradient estimator and still moves in a direction that pulls the negative chain samples toward the associated positive chain samples. Much has been written about the properties and alternative interpretations of CD and its similarity to autoencoder training,for example,[42],[225],[14],[197].

6.4.2SML

The SML algorithm(also known as persistent contrastive divergence or PCD)[220],[204]is an alternative way to sidestep an extended burn in of the negative phase Gibbs sampler.At each gradient update,rather than initializing the Gibbs chain at the positive phase sample as in CD,SML initializes the chain at the last state of the chain used for the previous update.In other words,SML uses a continually running Gibbs chain(or often a number of Gibbs chains run in parallel)from which samples are drawn to estimate the negative phase expectation.Despite the model para-meters changing between updates,these changes should be small enough that only a few steps of Gibbs(in practice, often one step is used)are required to maintain samples from the equilibrium distribution of the Gibbs chain,i.e., the model distribution.

A troublesome aspect of SML is that it relies on the Gibbs chain to mix well(especially between modes)for learning to succeed.Typically,as learning progresses and the weights of the RBM grow,the ergodicity of the Gibbs sample begins to break down.12If the learning rate associated with gradient

ascent t ^g(with E?^g %@log p exT

@ )is not reduced to

compensate,then the Gibbs sampler will diverge from the model distribution and learning will fail.Desjardins et al.

[58],Cho et al.[44],and Salakhutdinov[173],[174]have all considered various forms of tempered transitions to address the failure of Gibbs chain mixing,and convincing solutions have not yet been clearly demonstrated.A recently intro-duced promising avenue relies on depth itself,relying on the observation that mixing between modes is much easier on deeper layers[27](Section9.4).

Tieleman and Hinton[205]have proposed quite a different approach to addressing potential mixing problems of SML with their FPCD,and it has also been exploited to train DBMs[173]and construct a pure sampling algorithm for RBMs[39].FPCD builds on the surprising but robust tendency of Gibbs chains to mix better during SML learning than when the model parameters are fixed.The phenomenon is rooted in the form of the likelihood gradient itself(15).The samples drawn from the SML Gibbs chain are used in the negative phase of the gradient,which implies that the learning update will slightly increase the energy(decrease the probability)of those samples,making the region in the neighborhood of those samples less likely to be resampled and therefore making it more likely that the samples will move somewhere else (typically going near another mode).Rather than drawing samples from the distribution of the current model(with parameters ),FPCD exaggerates this effect by drawing samples from a local perturbation of the model with parameters ?and an update:

?

tt1

?e1à T tt1t ?tt ?

@

@ i

X T

t?1

log pexetTT

;e17T

where ?is the relatively large fast-weight learning rate ( ?> )and0< <1(but near1)is a forgetting factor that keeps the perturbed model close to the current model. Unlike tempering,FPCD does not converge to the model distribution as and ?go to0,and further work is necessary to characterize the nature of its approximation to the model distribution.Nevertheless,FPCD is a popular and apparently effective means of drawing approximate samples from the model distribution that faithfully repre-sent its diversity at the price of sometimes generating spurious samples in between two modes(because the fast weights roughly correspond to a smoothed view of the current model’s energy function).It has been applied in a variety of applications[205],[165],[113],and it has been transformed into a sampling algorithm[39]that also shares this fast mixing property with herding[215]for the same reason,i.e.,introducing negative correlations between con-secutive samples of the chain to promote faster mixing. 6.4.3Pseudolikelihood,Ratio Matching and More While CD,SML,and FPCD are by far the most popular methods for training RBMs and RBM-based models,all of these methods are perhaps most naturally described as offering different approximations to maximum likelihood training.There exist other inductive principles that are alternatives to maximum likelihood that can also be used to train RBMs.In particular,these include pseudolikelihood [32]and ratio matching[98].Both of these inductive principles attempt to avoid explicitly dealing with the partition function,and their asymptotic efficiency has been analyzed[140].Pseudolikelihood seeks to maximize the product of all1D conditional distributions of the form Pex d j x n dT,while ratio matching can be interpreted as an extension of score matching[97]to discrete data types.Both methods amount to weighted differences of the gradient of the RBM free energy13evaluated at a data point and at neighboring points.One potential drawback of these methods is that depending on the parameterization of the energy function,their computational requirements may scale up to Oen dTworse than CD,SML,FPCD,or denoising score matching[112],[210],discussed below.Marlin et al. [141]empirically compared all of these methods(except denoising score matching)on a range of classification,

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES1809

12.When weights become large,the estimated distribution is more

peaky,and the chain takes a very long time to mix,to move from mode to

mode,so that practically the gradient estimator can be very poor.This is a serious chicken-and-egg problem because if sampling is not effective,then neither is the training procedure,which may seem to stall and yield even larger weights.

13.The free energy Fex; Tis the energy associated with the data marginal probability,Fex; T?àlog PexTàlog Z and is tractable for the RBM.

reconstruction,and density modeling tasks and found that, in general,SML provided the best combination of overall performance and computational tractability.However,in a later study,the same authors[200]found denoising score matching to be a competitive inductive principle both in terms of classification performance(with respect to SML) and in terms of computational efficiency(with respect to analytically obtained score matching).Denoising score matching is a special case of the denoising autoencoder (DAE)training criterion(Section7.2.2)when the reconstruc-tion error residual equals a gradient,i.e.,the score function associated with an energy function,as shown in[210].

In the spirit of the Boltzmann machine gradient(15), several approaches have been proposed to train energy-based models.One is noise-contrastive estimation[80],in which the training criterion is transformed into a probabil-istic classification problem:distinguish between(positive) training examples and(negative)noise samples generated by a broad distribution(such as the Gaussian).Another family of approaches,more in the spirit of CD,relies on distinguishing positive examples(of the training distribu-tion)and negative examples obtained by perturbations of the positive examples[50],[33],[218].

7D IRECTLY L EARNING A P ARAMETRIC M AP FROM

I NPUT TO R EPRESENTATION

Within the framework of probabilistic models adopted in Section6,the learned representation is always associated with latent variables,specifically with their posterior distribution given an observed input x.Unfortunately,this posterior distribution tends to become very complicated and intractable if the model has more than a couple of interconnected layers,whether in the directed or undir-ected graphical model frameworks.It then becomes necessary to resort to sampling or approximate inference techniques,and to pay the associated computational and approximation error price.If the true posterior has a large number of modes that matter,then current inference techniques may face an unsurmountable challenge or endure a potentially serious approximation.This is in addition to the difficulties raised by the intractable partition function in undirected graphical models.Moreover, a posterior distribution over latent variables is not yet a simple usable feature vector that can,for example,be fed to a classifier.So,actual feature values are typically derived from that distribution,taking the latent variable’s expectation (as is typically done with RBMs),their marginal probability, or finding their most likely value(as in sparse coding).If we are to extract stable deterministic numerical feature values in the end anyway,an alternative(apparently) nonprobabilistic feature learning paradigm that focuses on carrying out this part of the computation very efficiently is that of autoencoders and other directly parameterized feature or representation functions.The commonality between these methods is that they learn a direct encoding, i.e.,a parametric map from inputs to their representation.

Regularized autoencoders,discussed next,also involve learning a decoding function that maps back from representation to input space.Sections8.1and11.3discuss direct encoding methods that do not require a decoder, such as semi-supervised embedding[217]and slow feature analysis[219].

7.1Autoencoders

In the autoencoder framework[129],[37],[93],one starts by explicitly defining a feature-extracting function in a specific parameterized closed form.This function,which we will denote f ,is called the encoder and will allow the straightforward and efficient computation of a feature vector h?f exTfrom an input x.For each example xetTfrom a dataset f xe1T;...;xeTTg,we define

hetT?f exetTT;e18Twhere hetTis the feature vector or representation or code computed from xetT.Another closed-form parameterized function g ,called the decoder,maps from feature space back into input space,producing a reconstruction r?g ehT. Whereas probabilistic models are defined from an explicit probability function and are trained to maximize(often approximately)the data likelihood(or a proxy),autoenco-ders are parameterized through their encoder and decoder and are trained using a different training principle.The set of parameters of the encoder and decoder are learned simultaneously on the task of reconstructing as well as possibly the original input,i.e.,attempting to incur the lowest possible reconstruction error Lex;rT—a measure of the discrepancy between x and its reconstruction r—over training examples.Good generalization means low recon-struction error at test examples,while having high reconstruction error for most other x configurations.To capture the structure of the data-generating distribution,it is therefore important that something in the training criterion or the parameterization prevents the autoencoder from learning the identity function,which has zero reconstruction error everywhere.This is achieved through various means in the different forms of autoencoders,as described below in more detail,and we call these regularized autoencoders.A particular form of regularization consists of constraining the code to have a low dimension,and this is what the classical autoencoder or PCA do.

In summary,basic autoencoder training consists in finding a value of parameter vector minimizing recon-struction error:

J AEe T?

X

t

LexetT;g ef exetTTTT;e19T

where xetTis a training example.This minimization is usually carried out by stochastic gradient descent as in the training of MLPs.Since autoencoders were primarily developed as MLPs,predicting their input,the most commonly used forms for the encoder and decoder are affine mappings,optionally followed by a nonlinearity:

f exT?s febtWxT;e20T

g ehT?s gedtW0hT;e21T

where s f and s g are the encoder and decoder activation functions(typically,the elementwise sigmoid or hyperbolic tangent nonlinearity,or the identity function if staying

1810IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013

linear).The set of parameters of such a model is ?f W;b;W 0;d g ,where b and d are called encoder and decoder bias vectors and W and W 0are the encoder and decoder weight matrices.

The choice of s g and L depends largely on the input domain range and nature and is usually chosen so that L returns a negative log likelihood for the observed value of x .A natural choice for an unbounded domain is a linear decoder with a squared reconstruction error,i.e.,s g ea T?a and L ex;r T?k x àr k 2.If inputs are bounded between 0and 1,however,ensuring a similarly bounded reconstruction can be achieved by using s g ?sigmoid .In addition,if the inputs are of a binary nature,a binary cross-entropy loss 14is sometimes used.

If both encoder and decoder use a sigmoid nonlinearity,then f ex Tand g eh Thave the exact same form as the conditionals P eh j v Tand P ev j h Tof binary RBMs (see Section 6.2).This similarity motivated an initial study [23]of the possibility of replacing RBMs with autoencoders as the basic pretraining strategy for building deep networks,as well as the comparative analysis of autoencoder reconstruc-tion error gradient and CD updates [14].

One notable difference in the parameterization is that RBMs use a single weight matrix,which follows naturally from their energy function,whereas the autoencoder framework allows for a different matrix in the encoder and decoder.In practice,however,weight tying in which one defines W 0?W T may be (and is most often)used,rendering the parameterizations identical.The usual train-ing procedures,however,differ greatly between the two approaches.A practical advantage of training autoencoder variants is that they define a simple tractable optimization objective that can be used to monitor progress .

In the case of a linear autoencoder (linear encoder and decoder)with squared reconstruction error,minimizing (19)learns the same subspace 15as PCA.This is also true when using a sigmoid nonlinearity in the encoder [37],but not if the weights W and W 0are tied (W 0?W T )because W cannot be forced into being small and W 0large to achieve a linear encoder.

Similarly,Le et al.[124]recently showed that adding a regularization term of the form P t P

j s 3eW j x et TTto a linear autoencoder with tied weights,where s 3is a nonlinear convex function,yields an efficient algorithm for learning linear ICA .

7.2Regularized Autoencoders

Like PCA,autoencoders were originally seen as a dimen-sionality reduction technique and thus used a bottleneck ,i.e.,d h d x .This can allow the autoencoder to simply duplicate the input in the features with perfect reconstruc-tion without having extracted more meaningful features.Recent research has demonstrated very successful alter-native ways,called regularized autoencoders ,to “constrain”

the representation,even when it is overcomplete.The effect of a bottleneck or of this regularization is that the autoencoder cannot reconstruct everything well that it is trained to reconstruct well;the training examples and generalization mean that reconstruction error is also small on test examples.An interesting justification [162]for the sparsity penalty (or any penalty that restricts in a soft way the volume of hidden configurations easily accessible by the learner)is that it acts in spirit like the partition function of RBMs by making sure that only few input configurations can have a low reconstruction error.

Alternatively,one can view the objective of the regular-ization applied to an autoencoder as making the represen-tation as “constant”(insensitive)as possible with respect to changes in input.This view immediately justifies two variants of regularized autoencoders described below:Contractive autoencoders (CAEs)reduce the number of effective degrees of freedom of the representation (around each point)by making the encoder contractive,i.e.,making the derivative of the encoder small (thus making the hidden units saturate),while the DAE makes the whole mapping “robust,”i.e.,insensitive to small random perturbations,or contractive,making sure that the recon-struction cannot stay good when moving in most directions around a training example.

7.2.1Sparse Autoencoders

The earliest uses of single-layer autoencoders for building deep architectures by stacking them [23]considered the idea of tying the encoder weights and decoder weights to restrict capacity as well as the idea of introducing a form of sparsity regularization [161].Sparsity in the representation can be achieved by penalizing the hidden unit biases (making these additive offset parameters more negative)[161],[134],[71],[118]or by directly penalizing the output of the hidden unit activations (making them closer to their saturating value at 0)[162],[123],[227].Penalizing the bias runs the danger that the weights could compensate for the bias,which could hurt numerical optimization.When directly penalizing the hidden unit outputs,several variants can be found in the literature,but a clear comparative analysis is still lacking.Although the L1penalty (i.e.,simply the sum of output elements h j in the case of sigmoid nonlinearity)would seem the most natural (because of its use in sparse coding),it is used in few papers involving sparse autoencoders.A close cousin of the L1penalty is the Student-t penalty (log e1th 2j T),originally proposed for sparse coding [155].Several papers penalize the average

output "h

j (e.g.,over a minibatch),and instead of pushing it to 0encourage it to approach a fixed target,either through a mean-square error penalty or maybe,more sensibly (because h j behaves like a probability),a Kullback-Liebler divergence with respect to the binomial distribution with

probability :à log "h

j àe1à Tlog e1à"h j Ttconstant,for example,with ?0:05.

7.2.2DAEs

Vincent et al.[212],[213]proposed altering the training objective in (19)from mere reconstruction to that of denoising an artificially corrupted input,i.e.,learning to reconstruct the clean input from a corrupted version.

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES 1811

14.L ex;r T?àP d x

i ?1x i log er i Tte1àr i Tlog e1àr i T.

15.Contrary to traditional PCA loading factors,but similarly to the parameters learned by probabilistic PCA,the weight vectors learned by a linear autoencoder are not constrained to form an orthonormal basis nor to have a meaningful ordering.They will,however,span the same subspace.

Learning the identity is no longer enough:The learner must capture the structure of the input distribution to optimally undo the effect of the corruption process,with the reconstruction essentially being a nearby but higher density point than the corrupted input.Fig.3illustrates that the DAE is learning a reconstruction function that corresponds to a vector field pointing toward high-density regions (the manifold where examples concentrate).

Formally,the objective optimized by a DAE is

J DAE?

X

t I E qe~x j xetTT

?

LexetT;g ef e~xTTT

?

;e22T

where I E qe~x j xetTT?á averages over corrupted examples~x drawn from corruption process qe~x j xetTT.In practice,this is optimized by stochastic gradient descent,where the stochastic gradient is estimated by drawing one or a few corrupted versions of xetTeach time xetTis considered. Corruptions considered in Vincent et al.[213]include additive isotropic Gaussian noise,salt and pepper noise for gray-scale images,and masking noise(salt or pepper only), for example,setting some randomly chosen inputs to0 (independently per example).Masking noise has been used in most of the simulations.Qualitatively better features are reported with denoising,resulting in improved classifica-tion,and DAE features performed similarly or better than RBM features.Chen et al.[43]show that a simpler alternative with a closed-form solution can be obtained when restricting to a linear autoencoder and have success-fully applied it to domain adaptation.

Vincent[210]relates DAEs to energy-based probabilistic models:DAEs basically learn in re~xTà~x a vector pointing

in the direction of the estimated score@log pe~xT

@~x (Fig.3).In the

special case of linear reconstruction and squared error, Vincent[210]shows that training an affine-sigmoid-affine DAE amounts to learning an energy-based model whose energy function is very close to that of a GRBM.Training uses a regularized variant of the score matching parameter estimation technique[97],[99],[112]termed denoising score matching[210].Swersky[199]had shown that training GRBMs with score matching is equivalent to training a regular autoencoder with an additional regularization term, while follow up on the theoretical results in[210],[200] showed the practical advantage of denoising to implement score matching efficiently.Finally,Alain and Bengio[1] generalize[210]and prove that DAEs of arbitrary para-meterization with small Gaussian corruption noise are general estimators of the score.

7.2.3CAEs

CAEs,proposed by Rifai et al.[167],follow up on DAEs and share a similar motivation of learning robust representa-tions.CAEs achieve this by adding an analytic contractive penalty to(19),the Frobenius norm of the encoder’s Jacobian,and results in penalizing the sensitivity of learned features to infinitesimal input variations.Let JexT?@f

@x

exTbe the Jacobian matrix of the encoder at x.The CAE’s training objective is

J CAE?

X

t

L

à

xetT;g

à

f exetTT

áá

t k JexetTTk2F;e23T

where is a hyperparameter controlling the strength of the regularization.For an affine sigmoid encoder,the contrac-tive penalty term is easy to compute:

J jexT?f exTje1àf exTjTW j;

JexT

k k2

F

?

X

j

ef exTje1àf exTjTT2k W j k2:e24T

There are at least three notable differences with DAEs, which may be partly responsible for the better performance that CAE features seem to empirically demonstrate:1)The sensitivity of the features is penalized16rather than that of the reconstruction.2)The penalty is analytic rather than stochastic:An efficiently computable expression replaces what might otherwise require d x corrupted samples to size up(i.e.,the sensitivity in d x directions).3)A hyperpara-meter allows fine control of the tradeoff between reconstruction and robustness(while the two are mingled in a DAE).Note,however,that there is a tight connection between the DAE and the CAE:As shown in[1],a DAE with small corruption noise can be seen(through a Taylor expansion)as a type of contractive autoencoder where the contractive penalty is on the whole reconstruction function rather than just on the encoder.17

A potential disadvantage of the CAE’s analytic penalty is that it amounts to only encouraging robustness to infinitesimal input variations.This is remedied in[168]with the CAE+H,which penalizes all higher order derivatives in an efficient stochastic manner by adding a term that encourages JexTand Jext Tto be close:

J CAEtH?

X

t

L

à

xetT;g exetTT

á

t k JexetTTk2F

t I E

?

k JexTàJext Tk2F

?

;

e25T

where $Ne0; 2IT,and is the associated regularization strength hyperparameter.

The DAE and CAE have been successfully used to win the final phase of the Unsupervised and Transfer Learning Challenge[145].The representation learned by the CAE

1812IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013 Fig.3.When data concentrate near a lower dimensional manifold,the

corruption vector is typically almost orthogonal to the manifold,and the

reconstruction function learns to denoise,map from low-probability

configurations(corrupted inputs)to high-probability ones(original

inputs),creating a vector field aligned with the score(derivative of the

estimated density).

16.That is,the robustness of the representation is encouraged.

17.But note that in the CAE,the decoder weights are tied to the encoder

weights to avoid degenerate solutions,and this should also make the

decoder contractive.

tends to be saturated rather than sparse,i.e.,most hidden units are near the extremes of their range(e.g.,0or1),and

their derivative@h iexT

@x is near0.The nonsaturated units are

few and sensitive to the inputs,with their associated filters (weight vectors)together forming a basis explaining the local changes around x,as discussed in Section8.2. Another way to get saturated(nearly binary)units is semantic hashing[175].

7.2.4PSD

Sparse coding[155]may be viewed as a kind of auto-encoder that uses a linear decoder with a squared reconstruction error,but whose nonparametric encoder f performs the comparatively nontrivial and relatively costly iterative minimization of(2).A practically successful variant of sparse coding and autoencoders,named PSD [109],replaces that costly and highly nonlinear encoding step by a fast noniterative approximation during recogni-tion(computing the learned features).PSD has been applied to object recognition in images and video[110],[111],[106], but also to audio[84],mostly within the framework of multistage convolutional deep architectures(Section11.2). The main idea can be summarized by the following equation for the training criterion,which is simultaneously optimized with respect to hidden codes(representation)hetTand with respect to parameterseW; T:

J PSD?

X

t

k hetTk1tk xetTàWhetTk22tk hetTàf exetTTk22;

e26Twhere xetTis the input vector for example t,hetTis the optimized hidden code for that example,and f eáTis the encoding function,the simplest variant being

f exetTT?tanhebtW T xetTT;e27Twhere encodin

g weights are the transpose of decoding weights.Many variants have been proposed,including the use of a shrinkage operation instead of the hyperbolic tangent[111].Note that the L1penalty on

h tends to make them sparse,and how this is the same criterion as sparse coding with dictionary learning(3)except for the additional constraint that one should be able to approximate the sparse codes h with a parameterized encoder f exT.One can thus view PSD as an approximation to sparse coding, where we obtain a fast approximate encoder.Once PSD is trained,object representations f exTare used to feed a classifier.They are computed quickly and can be further fine tuned:The encoder can be viewed as one stage or one layer of a trainable multistage system such as a feedforward neural network.

PSD can also be seen as a kind of autoencoder,where the codes h are given some freedom that can help to further improve reconstruction.One can also view the encoding penalty added on top of sparse coding as a kind of regularizer that forces the sparse codes to be nearly computable by a smooth and efficient encoder.This is in contrast with the codes obtained by complete optimization of the sparse coding criterion,which are highly nonsmooth or even nondifferentiable,a problem that motivated other approaches to smooth the inferred codes of sparse coding [4],so a sparse coding stage could be jointly optimized along with following stages of a deep architecture.

8R EPRESENTATION L EARNING AS M ANIFOLD L EARNING

Another important perspective on representation learning is based on the geometric notion of manifold.Its premise is the manifold hypothesis,according to which real-world data presented in high-dimensional spaces are expected to concentrate in the vicinity of a manifold M of much lower dimensionality d M,embedded in high-dimensional input space I R d x.This prior seems particularly well suited for AI tasks such as those involving images,sounds,or text,for which most uniformly sampled input configurations are unlike natural stimuli.As soon as there is a notion of “representation,”one can think of a manifold by consider-ing the variations in input space which are captured by or reflected(by corresponding changes)in the learned repre-sentation.To first approximation,some directions are well preserved(the tangent directions of the manifold),while others are not(directions orthogonal to the manifolds).With this perspective,the primary unsupervised learning task is then seen as modeling the structure of the data-supporting manifold.18The associated representation being learned can be associated with an intrinsic coordinate system on the embedded manifold.The archetypal manifold modeling algorithm is,not surprisingly,also the archetypal low-dimensional representation learning algorithm:PCA,which models a linear manifold.It was initially devised with the objective of finding the closest linear manifold to a cloud of data points.The principal components,i.e.,the representa-tion f exTthat PCA yields for an input point x,uniquely locates its projection on that manifold:It corresponds to intrinsic coordinates on the manifold.Data manifolds for complex real-world domains are,however,expected to be strongly nonlinear.Their modeling is sometimes approached as patchworks of locally linear tangent spaces[211],[38]. The large majority of algorithms built on this geometric perspective adopt a nonparametric approach,based on a training set nearest neighbor graph[181],[172],[203],[38], [7],[62],[214],[91],[209].In these nonparametric ap-proaches,each high-dimensional training point has its own set of free low-dimensional embedding coordinates, which are optimized so that certain properties of the neighborhood graph computed in original high-dimen-sional input space are best preserved.These methods, however,do not directly learn a parameterized feature extraction function f exTapplicable to new test points,19 which seriously limits their use as feature extractors,except in a transductive f61efec4998fcc22bcd10d6fparatively,few nonlinear manifold learning methods have been proposed that learn a parametric map that can directly compute a representation for new points;we will focus on these.

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES1813

18.Actually,data points need not strictly lie on the“manifold,”but the

probability density is expected to fall off sharply as one moves away from it,

and it may actually be constituted of several possibly disconnected

manifolds with different intrinsic dimensionality.

19.For several of these techniques,representations for new points

can be computed using the Nystro¨m approximation as has been

proposed as an extension in[20],but this remains cumbersome and

computationally expensive.

8.1Learning a Parametric Mapping Based on a

Neighborhood Graph

Some of the above nonparametric manifold learning algorithms can be modified to learn a parametric mapping f ,i.e.,applicable to new points:Instead of having free low-dimensional embedding coordinate“parameters”for each training point,these coordinates are obtained through an explicitly parameterized function,as with the parametric variant[208]of t-SNE[209].

Instead,semi-supervised embedding[217]learns a direct encoding while taking into account the manifold hypothesis through a neighborhood graph.A parameterized neural network architecture simultaneously learns a manifold embedding and a classifier.The training criterion encourages training set neigbhors to have similar representations.

The reduced and tightly controlled number of free parameters in such parametric methods,compared to their pure nonparametric counterparts,forces models to general-ize the manifold shape nonlocally[22],which can translate into better features and final performance[209].However, basing the modeling of manifolds on training set neighbor-hood relationships might be risky statistically in high-dimensional spaces(sparsely populated due to the curse of dimensionality)as,for example,most euclidean nearest neighbors risk having too little in common semantically. The nearest neighbor graph is simply not sufficiently densely populated to map out satisfyingly the wrinkles of the target manifold[17],[22],[16].It can also become problematic computationally to consider all pairs of data points,20which scales quadratically with training set size.

8.2Learning to Represent Nonlinear Manifolds Can we learn a manifold without requiring nearest neighbor searches?Yes,for example,with regularized autoencoders or PCA.In PCA,the sensitivity of the extracted components(the code)to input changes is the same regardless of position x.The tangent space is the same everywhere along the linear manifold.By contrast,for a nonlinear manifold,the tangent of the manifold changes as we move on the manifold,as illustrated in Fig. 6.In nonlinear representation-learning algorithms,it is conveni-ent to think about the local variations in the representation as the input x is varied on the manifold,i.e.,as we move among high-probability configurations.As we discuss below,the first derivative of the encoder therefore specifies the shape of the manifold(its tangent plane)around an example x lying on it.If the density was really concentrated on the manifold and the encoder had captured that,we would find the encoder derivatives to be nonzero only in the directions spanned by the tangent plane.

Let us consider sparse coding in this light:Parameter matrix W may be interpreted as a dictionary of input directions from which a different subset will be picked to model the local tangent space at an x on the manifold.That subset corresponds to the active,i.e.,nonzero,features for input x.Nonzero component h i will be sensitive to small changes of the input in the direction of the associated weight vector W:;i,whereas inactive features are more likely to be stuck at0until a significant displacement has taken place in input space.

The local coordinate coding(LCC)algorithm[223]is very similar to sparse coding,but is explicitly derived from a manifold f61efec4998fcc22bcd10d6fing the same notation as that of sparse coding in(2),LCC replaces regularization term k hetTk1?

P

j

j hetTj j yielding objective

J LCC?

X

t

k xetTàWhetTk22t

X

j

j hetTj jk W:;jàxetTk1tp

:

e28TThis is identical to sparse coding when p?à1,but with larger p it encourages the active anchor points for xetT(i.e.,the codebook vectors W:;j with nonnegligible j hetTj j that are combined to reconstruct xetT)to be not too far from xetT; hence the local aspect of the algorithm.An important theoretical contribution of Yu et al.[223]is to show that that any Lipschitz-smooth function :M!I R defined on a smooth nonlinear manifold M embedded in I R d x can be well approximated by a globally linear function with respect to the resulting coding scheme(i.e.,linear in h),where the accuracy of the approximation and required number d h of anchor points depend on d M rather than d x.This result has been further extended with the use of local tangent directions[222],as well as to multiple layers[137].

Let us now consider the efficient noniterative“feedfor-ward”encoders f ,used by PSD and the autoencoders reviewed in Section7.2,which are in the form of(20)or(27). The computed representation for x will only be significantly sensitive to input space directions associated with non-saturated hidden units(see,e.g.,(24)for the Jacobian of a sigmoid layer).These directions to which the representation is significantly sensitive,like in the case of PCA or sparse coding,may be viewed as spanning the tangent space of the manifold at point x.

Rifai et al.[167]empirically analyze in this light the singular value spectrum of the Jacobian(derivatives of representation vector with respect to input vector)of a trained CAE.Here,the SVD provides an ordered ortho-normal basis of most sensitive directions.The spectrum is sharply decreasing,indicating a relatively small number of significantly sensitive directions.This is taken as empirical evidence that the CAE indeed modeled the tangent space of a low-dimensional manifold.The leading singular vectors form a basis for the tangent plane of the estimated manifold,as illustrated in Fig. 4.The CAE criterion is believed to achieve this thanks to its two opposing terms: The isotropic contractive penalty that encourages the representation to be equally insensitive to changes in any input directions,and the reconstruction term that pushes different training points(in particular neighbors)to have a different representation(so they may be reconstructed accurately),thus counteracting the isotropic contractive pressure only in directions tangent to the manifold.

Analyzing learned representations through the lens of the spectrum of the Jacobian and relating it to the notion of tangent space of a manifold is feasible whenever the mapping is differentiable and regardless of how it was learned,whether as direct encoding(as in autoencoder variants)or derived from latent variable inference(as in

1814IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013

20.Even if pairs are picked stochastically,many must be considered

before obtaining one that weighs significantly on the optimization objective.

sparse coding or RBMs).Exact low-dimensional manifold models (like PCA)would yield nonzero singular values associated to directions along the manifold,and exact zeros for directions orthogonal to the manifold.But in smooth models like the CAE or the RBM,we will instead have large versus relatively small singular values (as opposed to nonzero versus exactly zero).

8.3Leveraging the Modeled Tangent Spaces

The local tangent space,at a point along the manifold,can be thought of as capturing locally valid transformations that were prominent in the training data.For example,Rifai et al.[169]examine the tangent directions extracted with an SVD of the Jacobian of CAEs trained on digits,images,or text-document data:They appear to correspond to small translations or rotations for images or digits and to substitutions of words within the same theme for documents.Such very local transformations along a data manifold are not expected to change class identity.To build their MTC,Rifai et al.[169]then apply techniques such as tangent distance [189]and tangent propagation [188],which were initially developed to build classifiers that are insensitive to input deformations provided as prior domain knowledge.Now,these techniques are applied using the local leading tangent directions extracted by a CAE,i.e.,not using any prior domain knowledge (except the broad prior about the existence of a manifold).This approach set a new record for MNIST digit classification among prior-knowledge free approaches.21

9

C ONNECTIONS BETWEEN P ROBABILISTIC AN

D D IRECT

E NCODING M ODELS

The standard likelihood framework for probabilistic models decomposes the training criterion for models with para-meters in two parts:the log-likelihood log P ex j T(or log P ex j h; Twith latent variables h )and the prior log P e T(or log P eh j Ttlog P e Twith latent variables).

9.1PSD:A Probabilistic Interpretation

In the case of the PSD algorithm,a connection can be made between the above standard probabilistic view and the direct encoding computation graph.The probabilistic model of PSD is the same directed generative model

P ex j h Tof sparse coding (Section 6.1.1),which only accounts for the decoder.The encoder is viewed as an approximate inference mechanism to guess P eh j x Tand initialize a MAP iterative inference (where the sparse prior P eh Tis taken into account).However,in PSD,the encoder is trained jointly with the decoder ,rather than simply taking the end result of iterative inference as a target to approximate.An interesting view 22to reconcile these facts is that the encoder is a parametric approximation for the MAP solution of a variational lower bound on the joint log likelihood .When MAP learning is viewed as a special case of variational learning (where the approximation of the joint log likelihood is with a dirac distribution located at the MAP solution),the variational recipe tells us to simultaneously improve the likelihood (reduce reconstruction error)and improve the variational approximation (reduce the discrepancy between the encoder output and the latent variable value).Hence,PSD sits at the intersection of probabilistic models (with latent variables)and direct encoding methods (which directly parameterize the mapping from input to represen-tation).RBMs also sit at the intersection because their particular parameterization includes an explicit mapping from input to representation,thanks to the restricted connectivity between hidden units.However,this nice property does not extend to their natural deep general-izations,i.e.,DBMs,discussed in Section 10.2.

9.2

Regularized Autoencoders Capture Local Structure of the Density

Can we also say something about the probabilistic inter-pretation of regularized autoencoders?Their training criterion does not fit the standard likelihood framework because this would involve a data-dependent “prior.”An interesting hypothesis emerges to answer that question,out of recent theoretical results [210],[1]:The training criterion of regularized autoencoders,instead of being a form of maximum likelihood,corresponds to a different inductive principle,such as score matching .The score matching connection is discussed in Section 7.2.2and has been shown for a particular parametrization of DAE and equivalent GRBM [210].The work in [1]generalizes this idea to a broader class of parameterizations (arbitrary encoders and decoders)and shows that by regularizing the autoencoder so that it be contractive,one obtains that the reconstruction function and its derivative estimate first and second derivatives of the underlying data-generative density.This view can be exploited to successfully sample from autoencoders,as shown in [170],[26].The proposed sampling algorithms are MCMCs similar to Langevin MCMC,using not just the estimated first derivative of the density,but also the estimated manifold tangents so as to stay close to manifolds of high density.

This interpretation connects well with the geometric perspective introduced in Section 8.The regularization effects (e.g.,due to a sparsity regularizer,a contractive regularizer,or the denoising criterion)ask the learned representation to be as insensitive as possible to the input,while minimizing reconstruction error on the training examples forces the representation to contain just enough

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES 1815

21.It yielded 0.81percent error rate using the full MNIST training set,with no prior deformations,and no

convolution.

22.

Suggested

by Ian

Goodfellow,personal

communication.

Fig.4.The tangent vectors to the high-density manifold as estimated by a contractive autoencoder [167].The original input is shown on the top left.Each tangent vector (images on right side of first row)corresponds to a plausible additive deformation of the original input,as illustrated on the second row,where a bit of the third singular vector is added to the original,to form a translated and deformed image.Unlike in PCA,the tangent vectors are different for different inputs because the estimated manifold is highly nonlinear.

information to distinguish them.The solution is that variations along the high-density manifolds are preserved,while other variations are compressed:The reconstruction function should be as constant as possible while reprodu-cing training examples,i.e.,points near a training example should be mapped to that training example (Fig.5).The reconstruction function should map an input toward the nearest point manifold,i.e.,the difference between reconstruction and input is a vector aligned with the estimated score (the derivative of the log density with respect to the input).The score can be zero on the manifold (where reconstruction error is also zero)at local maxima of the log density,but it can also be zero at local minima.It means that we cannot equate low reconstruction error with high estimated probability.The second deriva-tives of the log density correspond to the first derivatives of the reconstruction function,and on the manifold (where the first derivative is 0)they indicate the tangent directions of the manifold (where the first derivative remains near 0).As illustrated in Fig.6,the basic idea of the auto-encoder sampling algorithms in [170],[26]is to make MCMC moves where one 1)moves toward the manifold by following the density gradient (i.e.,applying a reconstruction)and 2)adds noise in the directions of the

leading singular vectors of the reconstruction (or encoder)Jacobian,corresponding to those associated with smallest second derivative of the log density.

9.3Learning Approximate Inference

Let us now consider from closer how a representation is computed in probabilistic models with latent variables,when iterative inference is required.There is a computation graph (possibly with random number generation in some of the nodes,in the case of MCMC)that maps inputs to representation and,in the case of deterministic inference (e.g.,MAP inference or variational inference),that function could be optimized directly.This is a way to generalize PSD that has been explored in recent work on probabilistic models at the intersection of inference and learning [4],[75],[79],[177],[195],[63],where a central idea is that instead of using a generic inference mechanism,one can use one that is learned and is more efficient,taking advantage of the specifics of the type of data on which it is applied.9.4Sampling Challenges

A troubling challenge with many probabilistic models with latent variables like most Boltzmann machine variants is that good MCMC sampling is required as part of the learning procedure,but that sampling becomes extremely inefficient (or unreliable)as training progresses because the modes of the learned distribution become sharper,making mixing between modes very slow .Whereas initially during training a learner assigns mass almost uniformly,as training progresse,its entropy decreases,approaching the entropy of the target distribution as more examples and more computation are provided.According to our manifold and natural clustering priors of Section 3.1,the target distribution has sharp modes (manifolds)separated by extremely low density areas.Mixing then becomes more difficult because MCMC methods,by their very nature,tend to make small steps to nearby high-probability configurations.This is illustrated in Fig.7.

Bengio et al.[27]suggest that deep representations could help mixing between such well-separated modes,based on both theoretical arguments and on empirical evidence.The idea is that if higher level representations disentangle the underlying abstract factors better,then small steps in this abstract space (e.g.,swapping from one category to another)can easily be done by MCMC.The high-level representations can then be mapped back to

1816IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST

2013

Fig.5.Reconstruction function r ex T(green)learned by a high-capacity autoencoder on 1D input,minimizing reconstruction error at training examples x et T(r ex et TTin red)while trying to be as constant as possible otherwise.The dotted line is the identity reconstruction (which might be obtained without the regularizer).The blue arrows show the vector field of r ex Tàx pointing toward high-density peaks estimated by the model and estimating the score (log-density

derivative).

Fig.6.Sampling from regularized autoencoders [170],[26]:Each MCMC step adds to current state x the noise ,mostly in the direction of the estimated manifold tangent plane H and projects back toward the manifold (high-density regions)by performing a reconstruction

step.

Fig.7.Top:Early,during training,MCMC mixes easily between modes because the estimated distribution has high entropy and puts enough mass everywhere for small-steps movements (MCMC)to go from mode to mode.Bottom:Later on,training relying on good mixing can stall because estimated modes are separated by wide low-density deserts.

the input space to obtain input-level samples,as in the DBNs sampling algorithm[94].This has been demon-strated both with DBNs and with the newly proposed algorithm for sampling from contracting and DAEs[170], [26].This observation alone does not suffice to solve the problem of training a DBN or a DBM,but it may provide a crucial ingredient,and it makes it possible to consider successfully sampling from deep models trained by procedures that do not require an MCMC,like the stacked regularized autoencoders used in[170].

9.5Evaluating and Monitoring Performance

It is always possible to evaluate a feature learning algorithm in terms of its usefulness with respect to a particular task(e.g.,object classification),with a predictor that is fed or initialized with the learned features.In practice,we do this by saving the features learned(e.g.,at regular intervals during training,to perform early stop-ping)and training a cheap classifier on top(such as a linear classifier).However,training the final classifier can be a substantial computational overhead(e.g.,supervised fine tuning a deep neural network usually takes more training iterations than the feature learning itself),so we may want to avoid having to train a classifier for every training iteration of the unsupervised learner and every hyperpara-meter setting.More importantly,this may give an incom-plete evaluation of the features(what would happen for other tasks?).All these issues motivate the use of methods to monitor and evaluate purely unsupervised performance. This is rather easy with all the autoencoder variants(with some caution outlined below)and rather difficult with the undirected graphical models such as the RBM and Boltzmann machines.

For autoencoder and sparse coding variants,test set reconstruction error can be readily computed,but by itself may be misleading because larger capacity(e.g.,more features,more training time)tends to systematically lead to lower reconstruction error,even on the test set.Hence,it cannot be used reliably for selecting most hyperparameters. On the other hand,denoising reconstruction error is clearly immune to this problem,so that solves the problem for DAEs.Based on the connection between DAEs and CAEs uncovered in[26],[1],this immunity can be extended to DAEs,but not to the hyperparameter controlling the amount of noise or of contraction.

For RBMs and some(not too deep)Boltzmann ma-chines,one option is the use of annealed importance sampling[150]to estimate the partition function(and thus the test log likelihood).Note that this estimator can have high variance and that it becomes less reliable(variance becomes too large)as the model becomes more interesting, with larger weights,more nonlinearity,sharper modes,and a sharper probability density function(see our previous discussion in Section9.4).Another interesting and recently proposed option for RBMs is to track the partition function during training[59],which could be useful for early stopping and reducing the cost of ordinary AIS.For toy RBMs(e.g.,25hidden units or less,or25inputs or less), the exact log likelihood can also be computed analytically, and this can be a good way to debug and verify some properties of interest.10G LOBAL T RAINING OF D EEP M ODELS

One of the most interesting challenges raised by deep architectures is:How should we jointly train all the levels?In the previous section and in Section4,we have only discussed how single-layer models could be combined to form a deep model.Here,we consider joint training of all the levels and the difficulties that may arise.

10.1The Challenge of Training Deep Architectures Higher level abstraction means more nonlinearity.It means that two nearby input configurations may be interpreted very differently because a few surface details change the underlying semantics,whereas most other changes in the surface details would not change the underlying semantics. The representations associated with input manifolds may be complex because the mapping from input to representa-tion may have to unfold and distort input manifolds that generally have complicated shapes into spaces where distributions are much simpler,where relations between factors are simpler,maybe even linear or involving many (conditional)independencies.Our expectation is that modeling the joint distribution between high-level abstrac-tions and concepts should be much easier in the sense of requiring much less data to learn.The hard part is learning a good representation that does this unfolding and disentangling.This may be at the price of a more difficult training problem,possibly involving ill conditioning and local minima.

It is only since2006that researchers have seriously investigated ways to train deep architectures,to the excep-tion of the convolutional networks[133].The first realization (Section4)was that unsupervised or supervised layerwise training was easier and that this could be taken advantage of by stacking single-layer models into deeper ones.

It is interesting to ask why does the layerwise unsupervised pretraining procedure sometimes help a supervised learner[65]. There seems to be a more general principle at play23of guiding the training of intermediate representations,which may be easier than trying to learn it all in one go.This is nicely related to the curriculum learning idea[24]that it may be much easier to learn simpler concepts first and then build higher level ones on top of simpler ones.This is also coherent with the success of several deep learning algo-rithms that provide some such guidance for intermediate representations,like semi-supervised embedding[217].

The question of why unsupervised pretraining could be helpful was extensively studied[65],trying to dissect the answer into a regularization effect and an optimization effect. The regularization effect is clear from the experiments, where the stacked RBMs or DAEs are used to initialize a supervised classification neural network[65].It may simply come from the use of unsupervised learning to bias the learning dynamics and initialize it in the basin of attraction of a“good”local minimum(of the training criterion),where “good”is in terms of generalization error.The underlying hypothesis exploited by this procedure is that some of the features or latent factors that are good at capturing the leading variations in the input distribution are also good at

BENGIO ET AL.:REPRESENTATION LEARNING:A REVIEW AND NEW PERSPECTIVES1817

23.First suggested to us by Leon Bottou.

capturing the variations in the target output random variables of interest(e.g.,classes).The optimization effect is more difficult to tease out because the top two layers of a deep neural net can just overfit the training set whether the lower layers compute useful features or not,but there are several indications that optimizing the lower levels with respect to a supervised training criterion can be challenging.

One such indication is that changing the numerical conditions of the optimization procedure can have a profound impact on the joint training of a deep architec-ture,for example,by changing the initialization range and changing the type of nonlinearity used[68],much more so than with shallow architectures.One hypothesis to explain some of the difficulty in the optimization of deep architectures is centered on the singular values of the Jacobian matrix associated with the transformation from the features at one level into the features at the next level[68].If these singular values are all small(less than1),then the mapping is contractive in every direction and gradients would vanish when propagated backward through many layers.This is a problem already discussed for recurrent neural networks[18],which can be seen as very deep networks with shared parameters at each layer,when unfolded in time.This optimization difficulty has moti-vated the exploration of second-order methods for deep architectures and recurrent networks,in particular Hessian-free second-order methods[142],[143].Unsupervised pretraining has also been proposed to help training recurrent networks and temporal RBMs[198],i.e.,at each time step,there is a local signal to guide the discovery of good features to capture in the state variables:model with the current state(as hidden units)the joint distribution of the previous state and the current input.Natural gradient [2]methods that can be applied to networks with millions of parameters(i.e.,with good scaling properties)have also been proposed[127],[157].Cho et al.[45]propose using adaptive learning rates for RBM training,along with a novel and interesting idea for a gradient estimator that takes into account the invariance of the model to flipping hidden unit bits and inverting signs of corresponding weight vectors.At least one study indicates that the choice of initialization(to make the Jacobian of each layer closer to 1across all its singular values)could substantially reduce the training difficulty of deep networks[68]and this is coherent with the success of the initialization procedure of echo state networks[104],as recently studied by Sutskever [196].There are also several experimental results[68],[69], [151]showing that the choice of hidden units nonlinearity could influence both training and generalization perfor-mance,with particularly interesting results obtained with sparse rectifying units[106],[151],[69],[117].An old idea regarding the ill-conditioning issue with neural networks is that of symmetry breaking:Part of the slowness of conver-gence may be due to many units moving together(like sheep)and all trying to reduce the output error for the same examples.By initializing with sparse weights[142]or by using often saturated nonlinearities(such as rectifiers as max-pooling units),gradients only flow along a few paths, which may help hidden units to specialize more quickly. Another promising idea to improve the conditioning of neural network training is to nullify the average value and slope of each hidden unit output[158],and possibly locally normalize magnitude as well[106].The debate still rages between using online methods such as stochastic gradient descent and using second-order methods on large mini-batches(of several thousand examples)[142],[123],with a variant of stochastic gradient descent recently winning an optimization challenge.24

Finally,several recent results exploiting large quantities of labeled data suggest that with proper initialization and choice of nonlinearity,very deep purely supervised net-works can be trained successfully without any layerwise pretraining[47],[69],[183],[117].Researchers report that in such conditions,layerwise unsupervised pretraining brought little or no improvement over pure supervised learning from scratch when training for long enough.This reinforces the hypothesis that unsupervised pretraining acts as a prior,which may be less necessary when very large quantities of labeled data are available,but which begs the question of why this had not been discovered earlier. The latest results reported in this respect[117]are particularly interesting because they allowed drastically reducing the error rate of object recognition on a benchmark (the1,000-class ImageNet task),where many more tradi-tional computer vision approaches had been evaluated (f61efec4998fcc22bcd10d6f/challenges/LSVRC/2012/ results).The main techniques that allowed this success include the following:efficient GPU training allowing one to train longer(more than100million visits of examples),an aspect first reported by Lee et al.[135]and Ciresan et al.

[47],a large number of labeled examples,artificially transformed examples(see Section11.1),a large number of tasks(1,000or 10,000classes for ImageNet),convolutional architecture with max-pooling(see Section11for these latter two techniques), rectifying nonlinearities(discussed above),careful initialization (discussed above),careful parameter update and adaptive learning rate heuristics,layerwise feature normalization(across features),and a new dropout trick based on injecting strong binary multiplicative noise on hidden units.This trick is similar to the binary noise injection used at each layer of a stack of DAEs.Future work is hopefully going to help identify which of these elements matter most,how to generalize them across a large variety of tasks and architectures,and,in particular,contexts where most examples are unlabeled,i.e.,including an unsupervised component in the training criterion.

10.2Joint Training of DBMs

We now consider the problem of joint training of all layers of a specific unsupervised model,the DBM.Whereas much progress(albeit with many unanswered questions)has been made on jointly training all the layers of deep architectures using back-propagated gradients(i.e.,mostly in the supervised setting),much less work has been done on their purely unsupervised counterpart,for example,with DBMs.25Note,however,that one could hope that the

1818IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.8,AUGUST2013

24.https://f61efec4998fcc22bcd10d6f/site/nips2011workshop/optimization-

challenges.

25.Joint training of all the layers of a deep belief net is much more

challenging because of the much harder inference problem involved.

本文来源:https://www.bwwdw.com/article/xv0e.html

Top