Ronan Collobert
I hold a research scientitst position in machine learning, at the IDIAP Research Institute, in Switzerland.
Ronan Collobert
IDIAP
Rue Marconi 19
CP 592
1920 Martigny
Switzerland
Email: ronan [at] collobert [dot] com
Phone: +41 27 721 77 06
I was previously a Research Staff Member at
NEC Laboratories of America.
Research Projects
Deep Learning for Spoken Term Detection.
Semantic Analysis of Text.
Face Technologies.
This project aims at enhancing various face technologies (head pose estimation, gender recognition, etc...)
in the framework of a Swiss CTI project with the startup
KeyLemon.
Collaborators: Yann Rodriguez,
KeyLemon.
Funding:
Swiss CTI.
Software
Torch7 is a machine learning library which aims at including state-of-the-art algorithms.
Torch7
is the last version of
Torch. It provides a Matlab-like
environment for state-of-the-art machine learning algorithms. It
is
easy to use and provides a very
efficient
implementation, thanks to
an easy and
fast scripting language (Lua) and a underlying
C
implementation. It is distributed under
a
BSD
license.
Torch5 was the
previous official version.
Torch7 is built
over
Torch5, bringing more flexibility in Tensor types, as
well as many optimizations (including SSE, OpenMP or CUDA).
Torch3 was written
completely in
C++. While it has been used in many
projects, I always found it myself too complicated. It also lacked
documentation.
Other versions like
Torch4 have been also
written.
Torch4 was developped in
Objective C. While being simpler than
Torch3, it did
not spread because sheeps prefer complicated languages
like
C++.
SVMTorch, a Support Vector Machine library.
Written while I was a PhD student, it was efficient at the time. I
would recommend using
now
LIBSVM,
as
SVMTorch has not been updated since a long while... or the
new
Torch 5 software which includes
efficient SVMs.
SENNA, a Natural Language Processing (NLP) tagger. Now version 3.0 (August 2011).
SENNA is a software distributed under a non-commercial license, which
outputs a host of Natural Language Processing (NLP) predictions:
part-of-speech (POS) tags, chunking (CHK), name entity recognition
(NER) and semantic role labeling (SRL).
SENNA is fast because it uses a simple architecture, self-contained
because it does not rely on the output of existing NLP system, and
accurate because it offers state-of-the-art or near state-of-the-art
performance.
SENNA is written in ANSI C, with about 2500 lines of code. It requires
about 150MB of RAM and should run on any IEEE floating point computer.
Publications
See also my Google Scholar page.
2012
R. Collobert, K. Kavukcuoglu and C. Farabet.
Implementing Neural Networks Efficiently.
In Neural Networks: Tricks of the Trade, G. Montavon, G. Orr and K-R. Muller (Ed), Springer, 2012.
Neural networks and machine learning algorithms in general require a
flexible environment where new algorithm prototypes and experiments
can be set up as quickly as possible with best possible computational
performance. To that end, we provide a new framework called Torch7,
that is especially suited to achieve both of these competing
goals. Torch7 is a versatile numeric computing framework and machine
learning library that extends a very lightweight and powerful
programming language Lua. Its goal is to provide a flexible
environment to design, train and deploy learning machines. Flexibility
is obtained via Lua, an extremely lightweight scripting language. High
performance is obtained via efficient OpenMP/SSE and CUDA
implementations of low-level numeric routines. Torch7 can also easily
be interfaced to third-party software thanks to Lua’s light C
interface.
@incollection{collobert:2012,
title = {Implementing Neural Networks Efficiently},
author = {R. Collobert and K. Kavukcuoglu and C. Farabet},
booktitle = {Neural Networks: Tricks of the Trade},
publisher = {Springer},
editor = {G. Montavon and G. Orr and K-R. Muller},
year = {2012}
}
J. Weston, F. Ratle, H. Mobahi and R. Collobert.
Deep Learning via Semi-Supervised Embedding.
In Neural Networks: Tricks of the Trade, G. Montavon, G. Orr and K-R. Muller (Ed), Springer, 2012.
We show how nonlinear embedding algorithms popular for use with
"shallow" semi-supervised learning techniques such as kernel methods
can be easily applied to deep multi-layer architectures, either as a
regularizer at the output layer, or on each layer of the
architecture. This trick provides a simple alternative to existing
approaches to deep learning whilst yielding competitive error rates
compared to those methods, and existing shallow semi-supervised
techniques.
@incollection{weston:2012,
title = {Deep Learning via Semi-Supervised Embedding},
author = {J. Weston and F. Ratle and H. Mobahi and R. Collobert},
booktitle = {Neural Networks: Tricks of the Trade},
publisher = {Springer},
editor = {G. Montavon and G. Orr and K-R. Muller},
year = {2012}
}
2011
R. Collobert, K. Kavukcuoglu and C. Farabet.
Torch7: A Matlab-like Environment for Machine Learning.
In BigLearn, NIPS Workshop, 2011.
Torch7 is a versatile numeric computing framework and machine
learning library that extends Lua. Its goal is to provide a
flexible environment to design and train learning
machines. Flexibility is obtained via Lua, an extremely
lightweight scripting language. High performance is obtained via
efficient OpenMP/SSE and CUDA implementations of low-level
numeric routines. Torch7 can easily be in- terfaced to
third-party software thanks to Lua’s light interface.
@inproceedings{collobert:2011c,
title = {Torch7: A Matlab-like Environment for Machine Learning},
author = {R. Collobert and K. Kavukcuoglu and C. Farabet},
booktitle = {BigLearn, NIPS Workshop},
year = {2011}
}
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa.
Natural Language Processing (Almost) from Scratch.
Journal of Machine Learning Research, 12:2493-2537, 2011.
We propose a unified neural network architecture and learning
algorithm that can be applied to various natural language
processing tasks including part-of-speech tagging, chunking,
named entity recognition, and semantic role labeling. This
versatility is achieved by trying to avoid task-specific
engineering and therefore disregarding a lot of prior
knowledge. Instead of exploiting man-made input features
carefully optimized for each task, our system learns internal
representations on the basis of vast amounts of mostly
unlabeled training data. This work is then used as a basis for
building a freely available tagging system with good
performance and minimal computational requirements.
@article{collobert:2011b,
title = {Natural Language Processing (Almost) from Scratch},
author = {R. Collobert and J. Weston and L. Bottou and M. Karlen and K. Kavukcuoglu and P. Kuksa},
journal = {Journal of Machine Learning Research},
volume = {12},
pages = {2493--2537},
year = {2011}
}
R. Collobert.
Deep Learning for Efficient Discriminative Parsing.
In AISTATS, 2011.
We propose a new fast purely discriminative algorithm for
natural language parsing, based on a “deep” recurrent
convolutional graph transformer network (GTN). Assuming a
decomposition of a parse tree into a stack of “levels”, the
network predicts a level of the tree taking into account
predictions of previous levels. Using only few basic text
features which leverage word representations from Collobert
and Weston (2008), we show similar performance (in F1 score)
to existing pure discriminative parsers and existing
“benchmark” parsers (like Collins parser, probabilistic
context-free grammars based), with a huge speed advantage.
I apologize for the incorrect F1 score I first reported for
Carreras et al' parser (90.5% inttead of 91.1%). I confused WSJ
sections 23 and 24 performance. Thanks to Michael Collins who
reported the bug.
@inproceedings{collobert:2011,
title = {Deep Learning for Efficient Discriminative Parsing},
author = {R. Collobert},
booktitle = {AISTATS},
year = {2011}
}
A. Bordes, J. Weston, R. Collobert and Y. Bengio.
Learning Structured Embeddings of Knowledge Bases.
In AAAI, 2011.
Many Knowledge Bases (KBs) are now readily available and
encompass colossal quantities of information thanks to either
a long-term funding effort (e.g. WordNet, OpenCyc) or a
collaborative process (e.g. Freebase, DBpedia). However, each
of them is based on a different rigorous symbolic framework
which makes it hard to use their data in other systems. It is
unfortunate because such rich structured knowledge might lead
to a huge leap forward in many other areas of AI like nat-
ural language processing (word-sense disambiguation, natu-
ral language understanding, ...), vision (scene
classification, image semantic annotation, ...) or
collaborative filtering. In this paper, we present a learning
process based on an innovative neural network architecture
designed to embed any of these symbolic representations into
a more flexible continuous vector space in which the original
knowledge is kept and enhanced. These learnt embeddings would
allow data from any KB to be easily used in recent machine
learning methods for prediction and information retrieval. We
illustrate our method on WordNet and Freebase and also
present a way to adapt it to knowledge extraction from raw
text.
@inproceedings{bordes:2011,
title = {Learning Structured Embeddings of Knowledge Bases},
author = {A. Bordes and J. Weston and R. Collobert and Y. Bengio},
booktitle = {AAAI},
year = {2011}
}
2010
P. Kuksa, Y. Qi, B. Bai, R. Collobert, J.Weston, V. Pavlovic and X. Ning.
Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level Bio-Relation Extraction.
In ECML PKDD, 2010.
Bio-relation extraction (bRE), an important goal in bio-text
mining, involves subtasks identifying relationships between
bio-entities in text at multiple levels, e.g., at the article,
sentence or relation level. A key limitation of current bRE
systems is that they are restricted by the availability of
annotated corpora. In this work we introduce a semi-supervised
approach that can tackle multi-level bRE via string comparisons
with mismatches in the string kernel framework. Our string kernel
implements an abstraction step, which groups similar words to
generate more abstract entities, which can be learnt with
unlabeled data. Specifically, two unsupervised models are proposed
to capture contextual (local or global) semantic similarities
between words from a large unannotated corpus. This
Abstraction-augmented String Kernel (ASK) allows for better
generalization of patterns learned from annotated data and
provides a unified framework for solving bRE with multiple degrees
of detail. ASK shows effective improvements over classic string
kernels on four datasets and achieves state-of-the-art bRE
performance without the need for complex linguistic features.
@inproceedings{qi:2010,
title = {Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level Bio-Relation Extraction},
author = {P. Kuksa and Y. Qi and B. Bai and R. Collobert and J.Weston and V. Pavlovic and X. Ning},
booktitle = {ECML PKDD},
year = {2010}
}
A. Bordes, N. Usunier, R. Collobert and J. Weston.
Towards Understanding Situated Natural Language.
In AISTATS, 2010.
We present a general framework and learning algorithm for the task
of concept labeling: each word in a given sentence has to be
tagged with the unique physical entity (e.g. person, object or
location) or abstract con- cept it refers to. Our method allows
both world knowledge and linguistic information to be used during
learning and prediction. We show experimentally that we can learn
to use world knowledge to resolve ambiguities in language, such as
word senses or ref- erence resolution, without the use of
handcrafted rules or features.
@inproceedings{bordes:2010,
title = {Towards Understanding Situated Natural Language},
author = {A. Bordes and N. Usunier and R. Collobert and J. Weston},
booktitle = {AISTATS},
year = {2010}
}
B. Bai, J. Weston, D. Grangier, R. Collobert, C. Cortes and M. Mohri.
Half Transductive Ranking.
In Artificial Intelligence and Statistics (AISTATS), 2010.
We study the standard retrieval task of ranking a fixed set of
items given a previously unseen query and pose it as the
half-transductive ranking problem. Transductive representations
(where the vector representation of each example is learned) allow
the generation of highly nonlinear embeddings that capture the
characteristics of object relationships without relying on a
specific choice of features, and require only relatively simple
optimization. Unfortunately, they have no direct out-of-sample
extension. Inductive approaches on the other hand allow for the
representation of unknown queries. We describe algorithms for this
setting which have the advantages of both transductive and
inductive approaches, and can be applied in unsupervised (either
reconstruction-based or graph-based) and supervised ranking
setups. We show empirically that our methods give strong
performance on all three tasks.
@inproceedings{bai:2010,
title = {Half Transductive Ranking},
author = {B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2010}
}
2009
A. Bordes, N. Usunier, J. Weston and R. Collobert.
Learning to Disambiguate Natural Language Using World Knowledge.
In NIPS workshop on Grammar Induction, Representation of Language and Language Learning, 2009.
We present a general framework and learning algorithm for the task
of concept labeling: each word in a given sentence has to be
tagged with the unique physical entity (e.g. person, object or
location) or abstract concept it refers to. Our method allows both
world knowledge and linguistic information to be used during
learning and prediction. We show experimentally that we can handle
natural language and learn to use world knowledge to resolve
ambiguities in language, such as word senses or coreference,
without the use of hand-crafted rules or features.
@inproceedings{bordes:2009,
title = {Learning to Disambiguate Natural Language Using World Knowledge},
author = {A. Bordes and N. Usunier and J. Weston and R. Collobert},
booktitle = {NIPS workshop on Grammar Induction, Representation of Language and Language Learning},
year = {2009}
}
B. Bai, J. Weston, D. Grangier, R. Collobert, C. Cortes and M. Mohri.
Ranking with Half Transductive Models.
In NIPS Workshop on Advances in Ranking, 2009.
We study the standard retrieval task of ranking a fixed set of
documents given a previously unseen query and pose it as the
half-transductive ranking problem. The task is partly transductive
as the document set is fixed. Existing transductive approaches are
natural non-linear methods for this set, but have no direct
out-of-sample extension. Functional approaches, on the other hand,
can be applied to the unseen queries, but fail to exploit the
availability of the document set in its full extent. This work
introduces a half-transductive approach to benefit from the
advantages of both transductive and functional approaches and show
its empirical advantage in supervised ranking setups.
@inproceedings{bai:2009d,
title = {Ranking with Half Transductive Models},
author = {B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri},
booktitle = {NIPS Workshop on Advances in Ranking},
year = {2009}
}
B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, C. Cortes and M. Mohri.
Polynomial Semantic Indexing.
In Advances in Neural Information Processing Systems (NIPS), 2009.
We present a class of nonlinear (polynomial) models that are
discriminatively trained to directly map from the word content in
a query-document or document-document pair to a ranking
score. Dealing with polynomial models on word features is
computationally challenging. We propose a low-rank (but diagonal
preserving) representation of our polynomial models to induce
feasible memory and computation requirements. We provide an
empirical study on retrieval tasks based on Wikipedia documents,
where we obtain state-of-the-art performance while providing
realistically scalable methods.
@inproceedings{bai:2009c,
title = {Polynomial Semantic Indexing},
author = {B. Bai and J. Weston and D. Grangier and R. Collobert and K. Sadamasa and Y. Qi and C. Cortes and M. Mohri},
booktitle = {Advances in Neural Information Processing Systems {(NIPS)}},
year = {2009}
}
R. Collobert and J. Weston.
Deep Learning in Natural Language Processing.
Tutorial at NIPS, 2009.
This tutorial will describe recent advances in deep learning
techniques for Natural Language Processing (NLP). Traditional NLP
approaches favour shallow systems, possibly cascaded, with
adequate hand-crafted features. In constrast, we are interested in
end-to-end architectures: these systems include several feature
layers, with increasing abstraction at each layer. Compared to
shallow systems, these feature layers are learnt for the task of
interest, and do not require any engineering. We will show how
neural networks are naturally well suited for end-to-end learning
in NLP tasks. We will study multi-tasking different tasks, new
semi-supervised learning techniques adapted to these deep
architectures, and review end-to-end structured output
learning. Finally, we will highlight how some of these advances
can be applied to other fields of research, like computer vision,
as well.
@misc{collobert:2009,
title = {Deep Learning in Natural Language Processing},
author = {R. Collobert and J. Weston},
howpublished = {Tutorial at NIPS},
year = {2009}
}
Y. Qi, P. Kuksa, R. Collobert, K. Sadamasa, K. Kavukcuoglu and J. Weston.
Semi-Supervised Sequence Labeling with Self-Learned Features.
In IEEE International Conference on Data Mining (ICDM), 2009.
Typical information extraction (IE) systems can be seen as tasks
assigning labels to words in a natural language sequence. The
performance is restricted by the availability of labeled words. To
tackle this issue, we propose a semi-supervised approach to
improve the sequence labeling procedure in IE through a class of
algorithms with self-learned features (SLF). A supervised
classifier can be trained with annotated text sequences and used
to classify each word in a large set of unannotated sentences. By
averaging predicted labels over all cases in the unlabeled corpus,
SLF training builds class label distribution patterns for each
word (or word attribute) in the dictionary and re-trains the
current model iteratively adding these distributions as extra word
features. Basic SLF models how likely a word could be assigned to
target class types. Several extensions are proposed, such as
learning words’ class boundary distributions. SLF exhibits robust
and scalable behaviour and is easy to tune. We applied this
approach on four classical IE tasks: named entity recognition
(German and English), part-of-speech tagging (English) and one
gene name recognition corpus. Experimental results show effective
improvements over the supervised baselines on all tasks. In
addition, when compared with the closely related self-training
idea, this approach shows favorable advantages.
@inproceedings{qi:2009a,
title = {Semi-Supervised Sequence Labeling with Self-Learned Features},
author = {Y. Qi and P. Kuksa and R. Collobert and K. Sadamasa and K. Kavukcuoglu and J. Weston},
booktitle = {IEEE International Conference on Data Mining ({ICDM})},
year = {2009}
}
B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle and K. Weinberger.
Learning to Rank with (a Lot of) Word Features.
Journal of Information Retrieval, volume Special Issue on Learning to Rank for Information Retrieval, 2009.
In this article we present Supervised Semantic Indexing (SSI)
which defines a class of nonlinear (quadratic) models that are
discriminatively trained to directly map from the word content in
a query-document or document-document pair to a ranking
score. Like Latent Semantic Indexing (LSI), our models take
account of correlations between words (synonymy,
polysemy). However, unlike LSI our models are trained from a
supervised signal directly on the ranking task of interest, which
we argue is the reason for our superior results. As the query and
target texts are modeled separately, our approach is easily
generalized to different retrieval tasks, such as cross-language
retrieval or online advertising placement. Dealing with models on
all pairs of words features is computationally challenging. We
propose several improvements to our basic model for addressing
this issue, including low rank (but diagonal preserving)
representations, correlated feature hashing (CFH) and
sparsification. We provide an empirical study of all these methods
on retrieval tasks based on Wikipedia documents as well as an
Internet advertisement task. We obtain state-of-the-art
performance while providing realistically scalable methods.
@article{bai:2009b,
title = {Learning to Rank with (a Lot of) Word Features},
author = {B. Bai and J. Weston and D. Grangier and R. Collobert and K. Sadamasa and Y. Qi and O. Chapelle and K. Weinberger},
journal = {Journal of Information Retrieval},
volume = {Special Issue on Learning to Rank for Information Retrieval},
year = {2009}
}
T. Barnickel, J. Weston, R. Collobert, H-W. Mewes and V. Stümpflen.
Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts.
PLoS one, 4(7), July 2009.
To reduce the increasing amount of time spent on literature search
in the life sciences, several methods for automated knowledge
extraction have been developed. Co-occurrence based approaches can
deal with large text corpora like MEDLINE in an acceptable time
but are not able to extract any specific type of semantic
relation. Semantic relation extraction methods based on syntax
trees, on the other hand, are computationally expensive and the
interpretation of the generated trees is difficult. Several
natural language processing (NLP) approaches for the biomedical
domain exist focusing specifically on the detection of a limited
set of relation types. For systems biology, generic approaches for
the detection of a multitude of relation types which in addition
are able to process large text corpora are needed but the number
of systems meeting both requirements is very limited. We introduce
the use of SENNA (‘Semantic Extraction using a Neural Network
Architecture’), a fast and accurate neural network based Semantic
Role Labeling (SRL) program, for the large scale extraction of
semantic relations from the biomedical literature. A comparison of
processing times of SENNA and other SRL systems or syntactical
parsers used in the biomedical domain revealed that SENNA is the
fastest Proposition Bank (PropBank) conforming SRL program
currently available. 89 million biomedical sentences were tagged
with SENNA on a 100 node cluster within three days. The accuracy
of the presented relation extraction approach was evaluated on two
test sets of annotated sentences resulting in precision/recall
values of 0.71/0.43. We show that the accuracy as well as
processing speed of the proposed semantic relation extraction
approach is sufficient for its large scale application on
biomedical text. The proposed approach is highly generalizable
regarding the supported relation types and appears to be
especially suited for general-purpose, broad-scale text mining
systems. The presented approach bridges the gap between fast,
cooccurrence-based approaches lacking semantic relations and
highly specialized and computationally demanding NLP approaches.
@article{barnickel:2009,
title = {Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts},
author = {T. Barnickel and J. Weston and R. Collobert and H-W. Mewes and V. St\"umpflen},
journal = {PLoS one},
volume = {4},
number = {7},
month = {July},
year = {2009}
}
Y. Qi, R. Collobert, P. Kuksa, K. Kavukcuoglu and J. Weston.
Combining Labeled and Unlabeled Data with Word-Class Distribution Learning.
In The 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.
We describe a novel simple and highly scalable semi-supervised
method called Word-Class Distribution Learning (WCDL), and apply
it the task of information extraction (IE) by utilizing unlabeled
sentences to improve supervised classification methods. WCDL
iteratively builds class label distributions for each word in the
dictionary by averaging predicted labels over all cases in the
unlabeled corpus, and re-training a base classifier adding these
distributions as word features. In contrast, traditional
self-training or co-training methods add self-labeled examples
(rather than features) which can degrade performance due to
incestuous learning bias. WCDL exhibits robust behavior, and has
no difficult parameters to tune. We applied our method on German
and English name en- tity recognition (NER) tasks. WCDL shows
improvements over self-training, multi-task semi-supervision or
supervision alone, in particular yielding a state-of-the art 75.72
F1 score on the German NER task.
@inproceedings{qi:2009,
title = {Combining Labeled and Unlabeled Data with Word-Class Distribution Learning},
author = {Y. Qi and R. Collobert and P. Kuksa and K. Kavukcuoglu and J. Weston},
booktitle = {The 18th ACM Conference on Information and Knowledge Management ({CIKM})},
year = {2009}
}
B. Bai, J. Weston, D. Grangier, R. Collobert, O. Chapelle and K. Weinberger.
Supervised Semantic Indexing.
In The 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.
In this article we propose Supervised Semantic Indexing (SSI) an
algorithm that is trained on (query, document) pairs of text
documents to predict the quality of their match. Like Latent
Semantic Indexing (LSI), our models take account of correlations
between words (synonymy, polysemy). However, unlike LSI our models
are trained with a supervised signal directly on the ranking task
of interest, which we argue is the reason for our superior
results. As the query and target texts are modeled separately, our
approach is easily generalized to different retrieval tasks, such
as online advertising placement. Dealing with models on all pairs
of words features is computationally challenging. We propose
several improvements to our basic model for addressing this issue,
including low rank (but diagonal preserving) representations, and
correlated feature hashing (CFH). We provide an empirical study of
all these methods on retrieval tasks based on Wikipedia documents
as well as an Internet advertisement task. We obtain
state-of-the-art performance while providing realistically
scalable methods.
@inproceedings{bai:2009a,
title = {Supervised Semantic Indexing},
author = {B. Bai and J. Weston and D. Grangier and R. Collobert and O. Chapelle and K. Weinberger},
booktitle = {The 18th ACM Conference on Information and Knowledge Management ({CIKM})},
year = {2009}
}
H. Mobahi, R. Collobert and J. Weston.
Deep Learning from Temporal Coherence in Video.
In International Conference on Machine Learning, ICML, 2009.
This work proposes a learning method for deep architectures that
takes advantage of sequential data, in particular from the
temporal coherence that naturally exists in unlabeled video
recordings. That is, two successive frames are likely to contain
the same object or objects. This coherence is used as a
supervisory signal over the unlabeled data, and is used to improve
the performance on a supervised task of interest. We demonstrate
the effectiveness of this method on some pose invariant object and
face recognition tasks.
@inproceedings{mobahi:2009,
title = {Deep Learning from Temporal Coherence in Video},
author = {H. Mobahi and R. Collobert and J. Weston},
booktitle = {International Conference on Machine Learning, {ICML}},
year = {2009}
}
Y. Bengio, J. Louradour, R. Collobert and J. Weston.
Curriculum Learning.
In International Conference on Machine Learning, ICML, 2009.
Humans and animals learn much better when the examples are not
randomly presented but organized in a meaningful order which
illustrates gradually more concepts, and more complex ones. Here,
we formalize such training strategies in the context of machine
learning, and call them curriculum learning. In the context of
recent research studying the difficulty of training in the
presence of non-convex training criteria (for deep deterministic
and stochastic neural networks), we explore curriculum learning in
various set-ups. The experiments show that significant
improvements in generalization can be achieved by using a
particular curriculum, i.e., the selection and order of training
examples. We hypothesize that curriculum learning has both an
effect on the speed of convergence of the training process to a
minimum and, in the case of non-convex criteria, on the quality of
the local minima obtained: curriculum learning can be seen as a
particular form of continuation method (a general strategy for
global optimization of non-convex functions).
@inproceedings{bengio:2009,
title = {Curriculum Learning},
author = {Y. Bengio and J. Louradour and R. Collobert and J. Weston},
booktitle = {International Conference on Machine Learning, {ICML}},
year = {2009}
}
B. Bai, J. Weston, R. Collobert and D. Grangier.
Supervised Semantic Indexing.
In 31st European Conference on Information Retrieval, 2009.
We present a class of models that are discriminatively trained to
directly map from the word content in a query-document or
document- document pair to a ranking score. Like Latent Semantic
Indexing (LSI), our models take account of correlations between
words (synonymy, pol- ysemy). However, unlike LSI our models are
trained with a supervised signal directly on the task of interest,
which we argue is the reason for our superior results. We provide
an empirical study on Wikipedia documents, using the links to
define document-document or query-document pairs, where we obtain
state-of-the-art performance using our method.
@inproceedings{bbai:2009,
title = {Supervised Semantic Indexing},
author = {B. Bai and J. Weston and R. Collobert and D. Grangier},
booktitle = {31st European Conference on Information Retrieval},
year = {2009}
}
2008
R. Collobert.
Torch.
NIPS Workshop on Machine Learning Open Source Software, 2008.
Torch provides a Matlab-like environment for state-of-the-art machine
learning algorithms. It is easy to use and very efficient, thanks to a
simple-yet-powerful fast scripting language (Lua), and a underlying C/C++
implementation. Torch is easily extensible and has been shown to scale to
very large applications.
The slides have been made in Torch!
@misc{collobert:2008a,
title = {Torch},
author = {R. Collobert},
howpublished = {NIPS Workshop on Machine Learning Open Source Software},
year = {2008}
}
M. Karlen, J. Weston, A. Erkan and R. Collobert.
Large Scale Manifold Transduction.
In International Conference on Machine Learning, ICML, 2008.
We show how the regularizer of Transductive Support Vector Machines (TSVM)
can be trained by stochastic gradient descent for linear models and
multi-layer architectures. The resulting methods can be trained
online, have vastly superior training and testing speed to existing TSVM
algorithms, can encode prior knowledge in the network architecture, and
obtain competitive error rates. We then go on to propose a natural
generalization of the TSVM loss function that takes into account
neighborhood and manifold information directly, unifying the two-stage Low
Density Separation method into a single criterion, and leading to
state-of-the-art results.
@inproceedings{karlen:2008,
title = {Large Scale Manifold Transduction},
author = {M. Karlen and J. Weston and A. Erkan and R. Collobert},
booktitle = {International Conference on Machine Learning, {ICML}},
year = {2008}
}
R. Collobert and J. Weston.
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning.
In International Conference on Machine Learning, ICML, 2008.
We describe a single convolutional neural network architecture that, given
a sentence, outputs a host of language processing predictions:
part-of-speech tags, chunks, named entity tags, semantic roles,
semantically similar words and the likelihood that the sentence makes sense
(grammatically and semantically) using a language model. The entire
network is trained jointly on all these tasks using weight-sharing, an
instance of multitask learning. All the tasks use labeled data except
the language model which is learnt from unlabeled text and represents a
novel form of semi-supervised learning for the shared tasks. We show how
both multitask learning and semi-supervised learning improve the
generalization of the shared tasks, resulting in state-of-the-art
performance.
@inproceedings{collobert:2008,
title = {A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning},
author = {R. Collobert and J. Weston},
booktitle = {International Conference on Machine Learning, {ICML}},
year = {2008}
}
J. Weston, F. Rattle and R. Collobert.
Deep Learning via Semi-Supervised Embedding.
In International Conference on Machine Learning, ICML, 2008.
We show how nonlinear embedding algorithms popular for use with shallow
semi-supervised learning techniques such as kernel methods can be applied
to deep multi-layer architectures, either as a regularizer at the output
layer, or on each layer of the architecture. This provides a simple
alternative to existing approaches to deep learning whilst yielding
competitive error rates compared to those methods, and existing shallow
semi-supervised techniques.
@inproceedings{weston:2008,
title = {Deep Learning via Semi-Supervised Embedding},
author = {J. Weston and F. Rattle and R. Collobert},
booktitle = {International Conference on Machine Learning, {ICML}},
year = {2008}
}
2007
R. Collobert and J. Weston.
Fast Semantic Extraction Using a Novel Neural Network Architecture.
In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 560-567, June 2007.
We describe a novel neural network architecture for the problem of semantic
role labeling. Many current solutions are complicated, consist of several
stages and handbuilt features, and are too slow to be applied as part of
real applications that require such semantic labels, partly because of
their use of a syntactic parser (Pradhan et al., 2004; Gildea and Jurafsky,
2002). Our method instead learns a direct mapping from source sentence to
semantic tags for a given predicate without the aid of a parser or a
chunker. Our resulting system obtains accuracies comparable to the current
state-of-the-art at a fraction of the computational cost.
@inproceedings{collobert:2007,
title = {Fast Semantic Extraction Using a Novel Neural Network Architecture},
author = {R. Collobert and J. Weston},
booktitle = {Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics},
pages = {560--567},
month = {June},
year = {2007}
}
2006
J. Weston, R. Collobert, F. Sinz, L. Bottou and V. Vapnik.
Inference with the Universum.
In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006), pages 1009-1016, ACM Press, 2006.
In this paper we study a new framework introduced by Vapnik (1998; 2006)
that is an alternative capacity concept to the large margin approach. In
the particular case of binary classification, we are given a set of labeled
examples, and a collection of rage the Universum by maximizing the number
of observed contradictions, and show experimentally that this approach
delivers accuracy improvements over using labeled data alone.
@inproceedings{weston:2006,
title = {Inference with the Universum},
author = {J. Weston and R. Collobert and F. Sinz and L. Bottou and V. Vapnik},
booktitle = {Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006)},
publisher = {ACM Press},
pages = {1009--1016},
location = {Pittsburgh, Pennsylvania},
year = {2006}
}
R. Collobert, F. Sinz, J. Weston and L. Bottou.
Large Scale Transductive SVMs.
Journal of Machine Learning Research, 7:1687-1712, September 2006.
We show how the Concave-Convex Procedure can be applied to Transductive SVMs, which
traditionally require solving a combinatorial search problem. This provides for the rst
time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the
utility of our approach. Software is available at
http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction.html.
@article{collobert:2006a,
title = {Large Scale Transductive SVMs},
author = {R. Collobert and F. Sinz and J. Weston and L. Bottou},
journal = {Journal of Machine Learning Research},
volume = {7},
pages = {1687-1712},
month = {September},
year = {2006}
}
R. Collobert, F. Sinz, J. Weston and L. Bottou.
Trading convexity for scalability.
In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006), pages 201-208, ACM Press, 2006.
Convex learning algorithms, such as Support Vector Machines (SVMs), are
often seen as highly desirable because they offer strong practical
properties and are amenable to theoretical analysis. However, in this work
we show how non-convexity can provide scalability advantages over
convexity. We show how concave-convex programming can be applied to produce
(i) faster SVMs where training errors are no longer support vectors, and
(ii) much faster Transductive SVMs.
This paper received the best paper award at ICML 2006 conference.
@inproceedings{collobert:2006,
title = {Trading convexity for scalability},
author = {R. Collobert and F. Sinz and J. Weston and L. Bottou},
booktitle = {Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006)},
publisher = {ACM Press},
pages = {201--208},
location = {Pittsburgh, Pennsylvania},
year = {2006}
}
2004
R. Collobert.
Large Scale Machine Learning.
Université Paris VI, 2004.
This thesis aims to address machine learning in general, with a particular
focus on large models and large databases. After introducing the learning
problem in a formal way, we first review several important machine learning
algorithms, particularly Multi Layer Perceptrons, Mixture of Experts and
Support Vector Machines. We then present a training method for Support
Vector Machines, adapted to reasonably large datasets. However the training
of such a model is still intractable on very large databases. We thus
propose a divide and conquer approach based on a kind of Mixture of Experts
in order to break up the training problem into small pieces, while keeping
good generalization performance. This mixture model can be applied to any
kind of existing machine learning algorithm. Even though it performs well
in practice the major drawback of this algorithm is the number of
hyper-parameters to tune, which makes it difficult to use. We thus prefer
afterward to focus on training improvements for Multi Layer Perceptrons,
which are easier to tune, and more suitable than Support Vector Machines
for large databases. We finally show that the margin idea introduced with
Support Vector Machines can be applied to a certain class of Multi Layer
Perceptrons, which leads to a fast algorithm with powerful generalization
performance.
@phdthesis{collobert:2004b,
title = {Large Scale Machine Learning},
author = {R. Collobert},
school = {Universit\'e Paris {VI}},
year = {2004}
}
R. Collobert and S. Bengio.
Links Between Perceptrons, MLPs and SVMs.
In International Conference on Machine Learning, ICML, 2004.
We propose to study links between three important classification
algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector
Machines (SVMs). We first study ways to control the capacity of Perceptrons
(mainly regularization parameters and early stopping), using the margin
idea introduced with SVMs. After showing that under simple conditions a
Perceptron is equivalent to an SVM, we show it can be computationally
expensive in time to train an SVM (and thus a Perceptron) with stochastic
gradient descent, mainly because of the margin maximization term in the
cost function. We then show that if we remove this margin maximization
term, the learning rate or the use of early stopping can still control the
margin. These ideas are extended afterward to the case of MLPs. Moreover,
under some assumptions it also appears that MLPs are a kind of mixture of
SVMs, maximizing the margin in the hidden layer space. Finally, we present
a very simple MLP based on the previous findings, which yields better
performances in generalization and speed than the other models.
Neural networks with the right criterion (like an hinge loss) work well,
with better scaling properties than SVMs...
Also, each neuron in the hidden layer of a neural network acts
interestingly as a kind of SVM, on a subset of the training set.
@inproceedings{collobert:2004a,
title = {Links Between Perceptrons, {MLPs} and {SVMs}},
author = {R. Collobert and S. Bengio},
booktitle = {International Conference on Machine Learning, {ICML}},
year = {2004}
}
R. Collobert and S. Bengio.
A Gentle Hessian for Efficient Gradient Descent.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2004.
Several second-order optimization methods for gradient descent algorithms
have been proposed over the years, but they usually need to compute the
inverse of the Hessian of the cost function (or an approximation of this
inverse) during training. In most cases, this leads to an O(n^2) cost in
time and space per iteration, where n is the number of parameters, which
is prohibitive for large n. We propose instead a study of the Hessian
before training. Based on a second order analysis, we show that a
block-diagonal Hessian yields an easier optimization problem than a full
Hessian. We also show that the condition of block-diagonality in common
machine learning models can be achieved by simply selecting an appropriate
training criterion. Finally, we propose a version of the SVM criterion
applied to MLPs, which verifies the aspects highlighted in this second
order analysis, but also yields very good generalization performance in
practice, taking advantage of the margin effect. Several empirical
comparisons on two benchmark datasets are given to illustrate this
approach.
Probably because in the past neural network were studied on very small
databases, many people believe neural networks overfit easily. I would
correct by: if not well tuned (like a SVM having a Gaussian kernel with a
small variance!) neural networks do overfit. But in fact, in many cases,
they are hard to train.
We show here that the choice of the architecture itself has an impact on
the optimization.
In particular we show that the margin criterion used in SVMs is well suited
for neural network optimization: with the hinge loss, the Hessian is better
conditioned than classical loss like Mean Squared Error.
@inproceedings{collobert:2004,
title = {A Gentle Hessian for Efficient Gradient Descent},
author = {R. Collobert and S. Bengio},
booktitle = {{IEEE} International Conference on Acoustic, Speech, and Signal Processing, {ICASSP}},
year = {2004}
}
2003
R. Collobert, Y. Bengio and S. Bengio.
Scaling Large Learning Problems with Hard Parallel Mixtures.
International Journal on Pattern Recognition and Artificial Intelligence (IJPRAI), 17(3):349-365, 2003.
A challenge for statistical learning is to deal with large data sets,
e.g. in data mining. The training time of ordinary Support Vector
Machines is at least quadratic, which raises a serious research challenge
if we want to deal with data sets of millions of examples. We propose a
``hard parallelizable mixture'' methodology which yields significantly
reduced training time through modularization and parallelization: the
training data is iteratively partitioned by a ``gater'' model in such a way
that it becomes easy to learn an ``expert'' model separately in each region
of the partition. A probabilistic extension and the use of a set of
generative models allows representing the gater so that all pieces of the
model are locally trained. For SVMs, time complexity appears empirically
to locally grow linearly with the number of examples, while
generalization performance can be enhanced. For the probabilistic version
of the algorithm, the iterative algorithm provably goes down in a cost
function that is an upper bound on the negative log-likelihood.
The aim was to use a divide-and-conquer method to break up the SVM
complexity and solve large scale classification tasks. While these mixtures
do work, they are unfortunately quite difficult to tune, because of the
additional hyper-parameters involved in the architecture.
This paper has been originally presented at the
International Workshop on Pattern Recognition with Support Vector Machines (SVM'2002).
The
original paper, with less experiments and
without probabilistic mixtures, has been published in NIPS.
A
variant, including more experiments than the NIPS version
has been published in Neural Computation.
@article{collobert:2003,
title = {Scaling Large Learning Problems with Hard Parallel Mixtures},
author = {R. Collobert and Y. Bengio and S. Bengio},
journal = {International Journal on Pattern Recognition and Artificial Intelligence ({IJPRAI})},
volume = {17},
number = {3},
pages = {349--365},
year = {2003}
}
C. Sanderson, S. Bengio, H. Bourlard, J. Mariéthoz, R. Collobert, M.F. BenZeghiba, F. Cardinaux and S. Marcel.
Speech & Face Based Biometric Authentication at IDIAP.
In International Conference on Multimedia and Expo, ICME, volume 3, pages 1-4, 2003.
We present an overview of recent research at IDIAP on speech & face based
biometric authentication. This paper covers user-customised passwords,
adaptation techniques, confidence measures (for use in fusion of audio &
visual scores), face verification in difficult image conditions, as well as
other related research issues. We also overview the open source Torch
library, which has aided in the implementation of the above mentioned
techniques.
@inproceedings{collobert:2003a,
title = {Speech \& Face Based Biometric Authentication at {IDIAP}},
author = {C. Sanderson and S. Bengio and H. Bourlard and J. Mari\'ethoz and R. Collobert and M.F. BenZeghiba and F. Cardinaux and S. Marcel},
booktitle = {International Conference on Multimedia and Expo, {ICME}},
volume = {3},
pages = {1--4},
year = {2003}
}
2002
R. Collobert, S. Bengio and J. Mariéthoz.
Torch: a modular machine learning software library.
Technical Report IDIAP-RR 02-46, IDIAP, 2002.
Many scientific communities have expressed a growing interest in machine
learning algorithms recently, mainly due to the generally good results they
provide, compared to traditional statistical or AI approaches. However,
these machine learning algorithms are often complex to implement and to use
properly and efficiently. We thus present in this paper a new machine
learning software library in which most state-of-the-art algorithms have
already been implemented and are available in a unified framework, in order
for scientists to be able to use them, compare them, and even extend them
for their own purposes. More interestingly, this library is freely
available under a BSD license and can be retrieved from the web by
everyone.
This presented the first version of the
Torch machine learning library. Several versions
have been developped since then, culminating with
Torch5,
the official last version.
@techreport{collobert:2002,
title = {{T}orch: a modular machine learning software library},
author = {R. Collobert and S. Bengio and J. Mari\'ethoz},
institution = {IDIAP},
type = {Technical Report IDIAP-RR},
number = {02-46},
year = {2002}
}
R. Collobert, S. Bengio and Y. Bengio.
A Parallel Mixture of SVMs for Very Large Scale Problems.
In T.G. Dietterich, S. Becker and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, NIPS 14, pages 633-640, MIT Press, 2002.
Support Vector Machines (SVMs) are currently the state-of-the-art models
for many classification problems but they suffer from the complexity of
their training algorithm which is at least quadratic with respect to the
number of examples. Hence, it is hopeless to try to solve real-life
problems having more than a few hundreds of thousands examples with
SVMs. The present paper proposes a new mixture of SVMs that can be easily
implemented in parallel and where each SVM is trained on a small subset of
the whole dataset. Experiments on a large benchmark dataset (Forest) as
well as a difficult speech database, yielded significant time improvement
(time complexity appears empirically to locally grow linearly with the
number of examples). In addition, and that is a surprise, a significant
improvement in generalization was observed on Forest.
This is our first paper on Mixture of SVMs. The aim was to use a
divide-and-conquer method to break up the SVM complexity and solve large
scale classification tasks. While these mixtures do work, they are
unfortunately quite difficult to tune, because of the additional
hyper-parameters involved in the architecture.
A
variant of this paper, with more experiments,
has been published in Neural Computation.
An
extended version, including more experiments
and probabilistic mixtures has been published in IJPRAI and presented at SVM'2002.
@inproceedings{collobert:2002a,
title = {A Parallel Mixture of {SVMs} for Very Large Scale Problems},
author = {R. Collobert and S. Bengio and Y. Bengio},
booktitle = {Advances in Neural Information Processing Systems, {NIPS} 14},
publisher = {MIT Press},
editor = {Dietterich, T.G. and Becker, S. and Ghahramani, Z.},
pages = {633--640},
year = {2002}
}
R. Collobert, S. Bengio and Y. Bengio.
A Parallel Mixture of SVMs for Very Large Scale Problems.
Neural Computation, 14(5):1105-1114, 2002.
Support Vector Machines (SVMs) are currently the state-of-the-art models
for many classification problems but they suffer from the complexity of
their training algorithm which is at least quadratic with respect to the
number of examples. Hence, it is hopeless to try to solve real-life
problems having more than a few hundreds of thousands examples with
SVMs. The present paper proposes a new mixture of SVMs that can be easily
implemented in parallel and where each SVM is trained on a small subset of
the whole dataset. Experiments on a large benchmark dataset (Forest)
yielded significant time improvement (time complexity appears empirically
to locally grow linearly with the number of examples). In addition, and
that is a surprise, a significant improvement in generalization was
observed.
The aim was to use a divide-and-conquer method to break up the SVM
complexity and solve large scale classification tasks. While these mixtures
do work, they are unfortunately quite difficult to tune, because of the
additional hyper-parameters involved in the architecture.
The
original paper, with less experiments, has
been published in NIPS.
An
extended version, including more experiments
and probabilistic mixtures has been published in IJPRAI and presented at SVM'2002.
@article{collobert:2002b,
title = {A Parallel Mixture of {SVMs} for Very Large Scale Problems},
author = {R. Collobert and S. Bengio and Y. Bengio},
journal = {Neural Computation},
volume = {14},
number = {5},
pages = {1105--1114},
year = {2002}
}
2001
R. Collobert and S. Bengio.
SVMTorch: Support Vector Machines for Large-Scale Regression Problems.
Journal of Machine Learning Research, 1:143-160, 2001.
Support Vector Machines (SVMs) for regression problems are trained by
solving a quadratic optimization problem which needs on the order of l
square memory and time resources to solve, where l is the number of
training examples. In this paper, we propose a decomposition algorithm,
SVMTorch (available at
http://www.idiap.ch/learning/SVMTorch.html),
which is similar to SVM-Light proposed by Joachims (1999) for
classification problems, but adapted to regression problems. With this
algorithm, one can now efficiently solve large-scale regression problems
(more than 20000 examples). Comparisons with Nodelib, another publicly
available SVM algorithm for large-scale regression problems from Flake and
Lawrence (2000) yielded significant time improvements. Finally, based on a
recent paper from Lin (2000), we show that a convergence proof exists for
our algorithm.
Our contribution extends Joachims ideas to the regression SVM
problem. Though nowadays it may seems obvious, curiously it was not the
technique used to train regression SVMs at the time we proposed this
extension.
@article{collobert:2001,
title = {{SVMT}orch: Support Vector Machines for Large-Scale Regression Problems},
author = {R. Collobert and S. Bengio},
journal = {Journal of Machine Learning Research},
volume = {1},
pages = {143--160},
year = {2001}
}
2000
Ronan Collobert.
Support Vector Machines: Théorie et Applications.
Université de Rennes I, 2000.
@mastersthesis{collobert:2000,
title = {Support Vector Machines: Th\'eorie et Applications},
author = {Ronan Collobert},
school = {Universit\'e de Rennes {I}},
year = {2000}
}
EPFL Courses
Collaborators
- Bing Bai, NEC Labs America, Princeton, USA
- Samy Bengio, Google Inc, Mountain View, USA
- Yoshua Bengio, Université de Montréal, Canada
- Antoine Bordes, CNRS, Heudiasyc, France
- Léon Bottou, NEC Labs America, Princeton, USA
- Clément Farabet, NYU, NY, USA
- David Grangier, AT&T, San Francisco, USA
- Koray Kavukcuoglu, NEC Labs America, Princeton, USA
- Conrad Sanderson, NICTA, Brisbane, Australia
- Vladimir Vapnik, NEC Labs America, Princeton, USA
- Jason Weston, Google Inc, New York, USA
Last Modified 12/9/12 15:15