Ronan Collobert


Research scientist at Apple, Machine Learning Research.

Before that I was a research scientist at Facebook AI Research, Idiap Research Institute and NEC Laboratories of America.

collobert [at] apple [dot] com

Ronan magical trick

Research Interests



Software


See also my GitHub page.


flashlight

An efficient modern C++ machine learning framework.

flashlight is written entirely in modern C++, with flexibility and efficiency in mind. It integrates a modern autograd, and supports most deep-learning models. It is lightweight, fully customizable, and extensible to go beyond classical models. A research framework for research in machine learning frameworks!

flashlight blog post

wav2letter

A toolbox for end-to-end speech recognition.

wav2letter is written in C++, and provides a number of recipes to train and evaluate various end-to-end speech recognition models. It originally advocated fully convolution-based models for efficiency, and then evolved in a flexible toolbox.

It embarks a very efficient standalone decoder.

wav2letter is now part of flashlight. The wav2letter repository only contains recipes.

wav2letter

torch

A machine learning library which aims at including state-of-the-art algorithms.

Torch7 is the last version of Torch. It provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language (Lua) and a underlying C implementation. It is distributed under a BSD license.

Torch 7 Torch 7 Git Torch 7 Overview Lua Torch 5 Torch 4 Torch 3

senna

A Natural Language Processing (NLP) tagger. Now version 3.0 (August 2011).

SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER) and semantic role labeling (SRL).

SENNA is fast because it uses a simple architecture, self-contained because it does not rely on the output of existing NLP system, and accurate because it offers state-of-the-art or near state-of-the-art performance.

SENNA is written in ANSI C, with about 2500 lines of code. It requires about 150MB of RAM and should run on any IEEE floating point computer.

SENNA

SVMTorch

A Support Vector Machine library.

Written while I was a PhD student, it was efficient at the time. I would recommend using now LIBSVM, as SVMTorch has not been updated since a long while.

SVMTorch

Publications


See also my Google Scholar page.


2021

V. Pratap, Q. Xu, T. Likhomanenko, G. Synnaeve and R. Collobert. Word Order Does Not Matter For Speech Recognition. arXiv, volume abs/2110.05994, 2021.

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.

@article{pratap:2021,
  title = {Word Order Does Not Matter For Speech Recognition},
  author = {V. Pratap and Q. Xu and T. Likhomanenko and G. Synnaeve and R. Collobert},
  journal = {arXiv},
  volume = {abs/2110.05994},
  year = {2021}
}
PDF

T. Likhomanenko, Q. Xu, J. Kahn, G. Synnaeve and R. Collobert. slimIPL: Language-model-free iterative pseudo-labeling. In Interspeech, 2021.

Recent results in end-to-end automatic speech recognition have demonstrated the efficacy of pseudo- labeling for semi-supervised models trained both with Connectionist Temporal Classification (CTC) and Sequence-to-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to further improve performance in ASR. We improve upon the IPL algorithm: as the model learns, we propose to iteratively re-generate transcriptions with hard labels (the most probable tokens), that is, without a language model. We call this approach Language-Model-Free IPL (slimIPL) and give a resultant training setup for low-resource settings with CTC-based models. slimIPL features a dynamic cache for pseudo-labels which reduces sensitivity to changes in relabeling hyperparameters and results in improved training stability. slimIPL is also highly-efficient and requires 3.5-4x fewer computational resources to converge than other state-of-the-art semi/self-supervised approaches. With only 10 hours of labeled audio, slimIPL is competitive with self-supervised approaches, and is state-of-the-art with 100 hours of labeled audio without the use of a language model both at test time and during pseudo-label generation.

@inproceedings{likhomanenko:2021b,
  title = {slim{IPL}: Language-model-free iterative pseudo-labeling},
  author = {T. Likhomanenko and Q. Xu and J. Kahn and G. Synnaeve and R. Collobert},
  booktitle = {Interspeech},
  year = {2021}
}
PDF

W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve and M. Auli. Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2021.

Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of labeled data from Libri-light as well as 53k hours of unlabeled data from LibriVox achieves WERs of 3.0%/5.2% on the clean and other test sets of Librispeech – rivaling the best published systems trained on 960 hours of labeled data only a year ago. Training on all labeled data of Librispeech achieves WERs of 1.5%/3.1%.

@inproceedings{hsu:2021,
  title = {Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training},
  author = {W.-N. Hsu and A. Sriram and A. Baevski and T. Likhomanenko and Q. Xu and V. Pratap and J. Kahn and A. Lee and R. Collobert and G. Synnaeve and M. Auli},
  booktitle = {{IEEE} International Conference on Acoustic, Speech, and Signal Processing, {ICASSP}},
  year = {2021}
}
PDF

T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert and G. Synnaeve. Rethinking evaluation in ASR: Are our models robust enough?. In Interspeech, 2021.

Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets – in particular, if models trained on a single dataset transfer to other (possibly out-of-domain) datasets. We show that, in general, reverberative and additive noise augmentation improves generalization performance across domains. Further, we demonstrate that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world noisy data. Finally, we show that training a single acoustic model on the most widely-used datasets – combined – reaches competitive performance on both research and real-world benchmarks.

@inproceedings{likhomanenko:2021a,
  title = {Rethinking evaluation in {ASR}: Are our models robust enough?},
  author = {T. Likhomanenko and Q. Xu and V. Pratap and P. Tomasello and J. Kahn and G. Avidov and R. Collobert and G. Synnaeve},
  booktitle = {Interspeech},
  year = {2021}
}
PDF

A. Conneau, A. Baevski, R. Collobert, A. Mohamed and M. Auli. Unsupervised cross-lingual representation learning for speech recognition. In Interspeech, 2021.

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to a comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages. We hope to catalyze research in low-resource speech understanding by releasing XLSR-53, a large model pretrained in 53 languages.

@inproceedings{conneau:2021,
  title = {Unsupervised cross-lingual representation learning for speech recognition},
  author = {A. Conneau and A. Baevski and R. Collobert and A. Mohamed and M. Auli},
  booktitle = {Interspeech},
  year = {2021}
}
PDF

V. Manohar, T. Likhomanenko, Q. Xu, W.-N. Hsu, R. Collobert, Y. Saraf, G. Zweig and A. Mohamed. Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition. In Automatic Speech Recognition and Understanding Workshop, ASRU, 2021.

In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised training. The proposed approach uses a teacher model which is updated as the exponential moving average of the student model parameters. This can be seen as a continuous version of the iterative pseudo-labeling approach for semi-supervised training. It is applicable for different training criteria, and in this paper we demonstrate it for frame-level hybrid hidden Markov model - deep neural network (HMM-DNN) models and sequence-level connectionist temporal classification (CTC) based models. The proposed approach shows more than 10% word error rate (WER) reduction over standard teacher-student training and more than 50% relative WER reduction over 10 hour supervised baseline when using large scale realistic unsupervised public videos in UK English and Italian languages.

@inproceedings{manohar:2021,
  title = {Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition},
  author = {V. Manohar and T. Likhomanenko and Q. Xu and W.-N. Hsu and R. Collobert and Y. Saraf and G. Zweig and A. Mohamed},
  booktitle = {Automatic Speech Recognition and Understanding Workshop, {ASRU}},
  year = {2021}
}
PDF

C. Talnikar, T. Likhomanenko, R. Collobert and G. Synnaeve. Joint masked CPC and CTC training for ASR. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2021.

Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec 2.0 requires a two-stage pipeline. In this paper we demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised masked Contrastive Predictive Coding (CPC) loss and the supervised audio-to-text alignment loss Connectionist Temporal Classification (CTC). We show that this joint training method directly optimizes performance for the downstream ASR task using unsupervised data while achieving similar word error rates to wav2vec 2.0 on the Librispeech 100-hours dataset. Finally, we postulate that solving the contrastive task is a regularization for the supervised CTC loss.

@inproceedings{talnikar:2021,
  title = {Joint masked {CPC} and {CTC} training for {ASR}},
  author = {C. Talnikar and T. Likhomanenko and R. Collobert and G. Synnaeve},
  booktitle = {{IEEE} International Conference on Acoustic, Speech, and Signal Processing, {ICASSP}},
  year = {2021}
}
PDF

T. Likhomanenko, Q. Xu, R. Collobert, G. Synnaeve and A. Rogozhnikov. CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings. In Advances in Neural Information Processing Systems, (NeurIPS), 2021.

Without positional information, attention-based transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed transformer models positional information. Absolute positional embeddings are simple to implement, but suffer from generalization issues when evaluating on sequences of different length than those seen at training time. Relative positions are more robust to length change, but are more complex to implement and yield inferior model throughput. In this paper, we propose an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization). In addition, our empirical evaluation on state-of-the-art models in machine translation, image and speech recognition demonstrates that CAPE leads to better generalization performance as well as increased stability with respect to training hyper-parameters.

@inproceedings{likhomanenko:2021,
  title = {{CAPE}: Encoding Relative Positions with Continuous Augmented Positional Embeddings},
  author = {T. Likhomanenko and Q. Xu and R. Collobert and G. Synnaeve and A. Rogozhnikov},
  booktitle = {Advances in Neural Information Processing Systems, ({NeurIPS})},
  year = {2021}
}
PDF

2020

R. Collobert, A. Hannun and G. Synnaeve. Word-level Speech Recognition with a Letter to Word Encoder. In International Conference on Machine Learning, ICML, 2020.

We propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters. The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention. We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. We also show that our direct-to-word approach retains the ability to predict words not seen at training time without any retraining. Finally, we demonstrate that a word-level model can use a larger stride than a sub-word level model while maintaining accuracy. This makes the model more efficient both for training and inference.

@inproceedings{collobert:2020,
  title = {Word-level Speech Recognition with a Letter to Word Encoder},
  author = {R. Collobert and A. Hannun and G. Synnaeve},
  booktitle = {International Conference on Machine Learning, {ICML}},
  year = {2020}
}
PDF

V. Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhomanenko, A. Hannun, V. Liptchinsky, G. Synnaeve and R. Collobert. Scaling Up Online Speech Recognition Using ConvNets. In Interspeech, 2020.

We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency, accuracy, and discuss how these metrics can be tuned based on the user requirements.

@inproceedings{pratap:2020b,
  title = {Scaling Up Online Speech Recognition Using ConvNets},
  author = {V. Pratap and Q. Xu and J. Kahn and G. Avidov and T. Likhomanenko and A. Hannun and V. Liptchinsky and G. Synnaeve and R. Collobert},
  booktitle = {Interspeech},
  year = {2020}
}
PDF

S. Subramanian, R. Collobert, M.'A. Ranzato and Y.-L. Boureau. Multi-scale Transformer Language Models. arXiv, volume abs/2005.00581, 2020.

We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments on large-scale language modeling benchmarks empirically demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show that it is possible to train a hierarchical variant with 30 layers that has 23% smaller memory footprint and better perplexity, compared to a vanilla transformer with less than half the number of layers, on the Toronto BookCorpus. We analyze the advantages of learned representations at multiple scales in terms of memory footprint, compute time, and perplexity, which are particularly appealing given the quadratic scaling of transformers’ run time and memory usage with respect to sequence length.

@article{subramanian:2020,
  title = {Multi-scale Transformer Language Models},
  author = {S. Subramanian and R. Collobert and M.'A. Ranzato and Y.-L. Boureau},
  journal = {arXiv},
  volume = {abs/2005.00581},
  year = {2020}
}
PDF

V. Pratap, Q. Xu, A. Sriram, G. Synnaeve and R. Collobert. MLS: A large-scale multilingual dataset for speech research. In Interspeech, 2020.

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.

@inproceedings{pratap:2020a,
  title = {{MLS}: A large-scale multilingual dataset for speech research},
  author = {V. Pratap and Q. Xu and A. Sriram and G. Synnaeve and R. Collobert},
  booktitle = {Interspeech},
  year = {2020}
}
PDF

V. Pratap, A. Sriram, P. Tomasello, A. Hannun, V. Liptchinsky, G. Synnaeve and R. Collobert. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters. In Interspeech, 2020.

We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and overall simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language (from 100 hours to 1100 hours). We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads (one per language “cluster”). We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages. We see 20.9%, 23% and 28.8% average WER relative reduction compared to monolingual baselines on joint model, joint model with language input and multi head model respectively. To our knowledge, this is the first work studying multi-lingual ASR at massive scale, with more than 50 languages and more than 16,000 hours of audio across them.

@inproceedings{pratap:2020,
  title = {Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters},
  author = {V. Pratap and A. Sriram and P. Tomasello and A. Hannun and V. Liptchinsky and G. Synnaeve and R. Collobert},
  booktitle = {Interspeech},
  year = {2020}
}
PDF

J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A.-R. Mohamed and E. Dupoux. Libri-Light: A Benchmark for ASR with Limited or No Supervision. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2020.

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

@inproceedings{kahn:2020,
  title = {Libri-Light: A Benchmark for {ASR} with Limited or No Supervision},
  author = {J. Kahn and M. Rivi{\`e}re and W. Zheng and E. Kharitonov and Q. Xu and P.-E. Mazar\'e and J. Karadayi and V. Liptchinsky and R. Collobert and C. Fuegen and T. Likhomanenko and G. Synnaeve and A. Joulin and A.-R. Mohamed and E. Dupoux},
  booktitle = {{IEEE} International Conference on Acoustic, Speech, and Signal Processing, {ICASSP}},
  year = {2020}
}
PDF

Q. Xu, T. Likhomanenko, J. Kahn, A. Hannun, G. Synnaeve and R. Collobert. Iterative pseudo-labeling for speech recognition. In Interspeech, 2020.

Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the LibriSpeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the LibriSpeech training transcriptions to foster research in low-resource, semi-supervised ASR.

@inproceedings{xu:2020,
  title = {Iterative pseudo-labeling for speech recognition},
  author = {Q. Xu and T. Likhomanenko and J. Kahn and A. Hannun and G. Synnaeve and R. Collobert},
  booktitle = {Interspeech},
  year = {2020}
}
PDF

G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky and R. Collobert. End-to-end ASR: from supervised to semi-supervised learning with modern architectures. In Workshop on Self-supervision in Audio and Speech, International Conference on Machine Learning, ICML, 2020.

We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external language models.

@inproceedings{synnaeve:2020,
  title = {End-to-end {ASR}: from supervised to semi-supervised learning with modern architectures},
  author = {G. Synnaeve and Q. Xu and J. Kahn and T. Likhomanenko and E. Grave and V. Pratap and A. Sriram and V. Liptchinsky and R. Collobert},
  booktitle = {Workshop on Self-supervision in Audio and Speech, International Conference on Machine Learning, {ICML}},
  year = {2020}
}
PDF

2019

S. Schneider, A. Baevski, R. Collobert and M. Auli. wav2vec: unsupervised pre-training for speech recognition. In Interspeech, 2019.

We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36 % when only a few hours of transcribed data is available. Our approach achieves 2.43 % WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

@inproceedings{schneider:2019,
  title = {wav2vec: unsupervised pre-training for speech recognition},
  author = {S. Schneider and A. Baevski and R. Collobert and M. Auli},
  booktitle = {Interspeech},
  year = {2019}
}
PDF

V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V. Liptchinsky and R. Collobert. wav2letter++: The Fastest Open-source Speech Recognition System. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2019.

This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2× faster than other optimized frameworks for training end-to-end neural networks for speech recognition. We also show that wav2letter++’s training times scale linearly to 64 GPUs, the highest we tested, for models with 100 million parameters. High-performance frameworks enable fast iteration, which is often a crucial factor in successful research and model tuning on new datasets and tasks.

@inproceedings{pratap:2019,
  title = {wav2letter++: The Fastest Open-source Speech Recognition System},
  author = {V. Pratap and A. Hannun and Q. Xu and J. Cai and J. Kahn and G. Synnaeve and V. Liptchinsky and R. Collobert},
  booktitle = {{IEEE} International Conference on Acoustic, Speech, and Signal Processing, {ICASSP}},
  year = {2019}
}
PDF

A. Hannun, A. Lee, Q. Xu and R. Collobert. Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions. In Interspeech, 2019.

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.

@inproceedings{hannun:2019,
  title = {Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions},
  author = {A. Hannun and A. Lee and Q. Xu and R. Collobert},
  booktitle = {Interspeech},
  year = {2019}
}
PDF

Y. Adi, N. Zeghidour, R. Collobert, N. Usunier, V. Liptchinsky and G. Synnaeve. To reverse the gradient or not: an empirical comparison of adversarial and multi-task learning in speech recognition. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2019.

Transcribed datasets typically contain speaker identity for each instance in the data. We investigate two ways to incorporate this information during training: Multi-Task Learning and Adversarial Learning. In multi-task learning, the goal is speaker prediction; we expect a performance improvement with this joint training if the two tasks of speech recognition and speaker recognition share a common set of underlying features. In contrast, adversarial learning is a means to learn representations invariant to the speaker. We then expect better performance if this learnt invariance helps generalizing to new speakers. While the two approaches seem natural in the context of speech recognition, they are incompatible because they correspond to opposite gradients back-propagated to the model. In order to better understand the effect of these approaches in terms of error rates, we compare both strategies in controlled settings. Moreover, we explore the use of additional un-transcribed data in a semi-supervised, adversarial learning manner to improve error rates. Our results show that deep models trained on big datasets already develop invariant representations to speakers without any auxiliary loss. When considering adversarial learning and multi-task learning, the impact on the acoustic model seems minor. However, models trained in a semi-supervised manner can improve error-rates.

@inproceedings{adi:2019,
  title = {To reverse the gradient or not: an empirical comparison of adversarial and multi-task learning in speech recognition},
  author = {Y. Adi and N. Zeghidour and R. Collobert and N. Usunier and V. Liptchinsky and G. Synnaeve},
  booktitle = {{IEEE} International Conference on Acoustic, Speech, and Signal Processing, {ICASSP}},
  year = {2019}
}
PDF

T. Likhomanenko, G. Synnaeve and R. Collobert. Who Needs Words? Lexicon-free Speech Recognition. In Interspeech, 2019.

Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (character) contexts, which is key for good speech recognition performance downstream. We specifically show that the lexicon-free decoding performance (WER) on utterances with OOV words using character-based LMs is better than lexicon-based decoding, both with character or word-based LMs.

@inproceedings{likhomanenko:2019,
  title = {Who Needs Words? Lexicon-free Speech Recognition},
  author = {T. Likhomanenko and G. Synnaeve and R. Collobert},
  booktitle = {Interspeech},
  year = {2019}
}
PDF

R. Collobert, A. Hannun and G. Synnaeve. A fully differentiable beam search decoder. In International Conference on Machine Learning, ICML, 2019.

We introduce a new beam search decoder that is fully differentiable, making it possible to optimize at training time through the inference procedure. Our decoder allows us to combine models which operate at different granularities (e.g. acoustic and language models). It can be used when target sequences are not aligned to input sequences by considering all possible alignments between the two. We demonstrate our approach scales by applying it to speech recognition, jointly training acoustic and word-level language models. The system is end-to-end, with gradients flowing through the whole architecture from the word-level transcriptions. Recent research efforts have shown that deep neural networks with attention-based mechanisms can successfully train an acoustic model from the final transcription, while implicitly learning a language model. Instead, we show that it is possible to discriminatively train an acoustic model jointly with an explicit and possibly pretrained language model.

@inproceedings{collobert:2019,
  title = {A fully differentiable beam search decoder},
  author = {R. Collobert and A. Hannun and G. Synnaeve},
  booktitle = {International Conference on Machine Learning, {ICML}},
  year = {2019}
}
PDF

D. Palaz, M. Magimai-Doss and R. Collobert. End-to-End Acoustic Modeling using Convolutional Neural Networks for HMM-based Automatic Speech Recognition. Speech Communication, 2019.

In hidden Markov model (HMM) based automatic speech recognition (ASR) system, modeling the statistical relationship between the acoustic speech signal and the HMM states that represent linguistically motivated subword units such as phonemes is a crucial step. This is typically achieved by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then training a classifier such as artificial neural networks (ANN), Gaussian mixture model that estimates the emission probabilities of the HMM states. This paper investigates an end-to-end acoustic modeling approach using convolutional neural networks (CNNs), where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output. Alternately, as opposed to a divide and conquer strategy (i.e., separating feature extraction and statistical modeling steps), in the proposed acoustic modeling approach the relevant features and the classifier are jointly learned from the raw speech signal. Through ASR studies and analyses on multiple languages and multiple tasks, we show that: (a) the proposed approach yields consistently a better system with fewer parameters when compared to the conventional approach of cepstral feature extraction followed by ANN training, (b) unlike conventional method of speech processing, in the proposed approach the relevant feature representations are learned by first processing the input raw speech at the sub-segmental level (≈ 2 ms). Specifically, through an analysis we show that the filters in the first convolution layer automatically learn “in-parts” formant-like information present in the sub-segmental speech, and (c) the intermediate feature representations obtained by subsequent filtering of the first convolution layer output are more discriminative compared to standard cepstral features and could be transferred across languages and domains.

@article{palaz:2019,
  title = {End-to-End Acoustic Modeling using Convolutional Neural Networks for {HMM}-based Automatic Speech Recognition},
  author = {D. Palaz and M. Magimai-Doss and R. Collobert},
  journal = {Speech Communication},
  year = {2019}
}
PDF

2018

N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert and E. Dupoux. End-to-End Speech Recognition From the Raw Waveform. In Interspeech, 2018.

State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks, and the second one by the scattering transform. We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.

@inproceedings{zeghidour:2018a,
  title = {End-to-End Speech Recognition From the Raw Waveform},
  author = {N. Zeghidour and N. Usunier and G. Synnaeve and R. Collobert and E. Dupoux},
  booktitle = {Interspeech},
  year = {2018}
}
PDF

N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve and R. Collobert. Fully Convolutional Speech Recognition. arXiv, volume abs/1812.06864, 2018.

Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling. This fully convolutional approach is trained end-to-end to predict characters from the raw waveform, removing the feature extraction step altogether. An external convolutional language model is used to decode words. On Wall Street Journal, our model matches the current state-of-the-art. On Librispeech, we report state-of-the-art performance among end-to-end models, including Deep Speech 2, that was trained with 12 times more acoustic data and significantly more linguistic data.

@article{zeghidour:2018,
  title = {Fully Convolutional Speech Recognition},
  author = {N. Zeghidour and Q. Xu and V. Liptchinsky and N. Usunier and G. Synnaeve and R. Collobert},
  journal = {arXiv},
  volume = {abs/1812.06864},
  year = {2018}
}
PDF

2017

V. Liptchinsky, G. Synnaeve and R. Collobert. Letter-Based Speech Recognition with Gated ConvNets. arXiv, volume abs/1712.09444, 2017.

In the recent literature, "end-to-end" speech systems often refer to letter-based acoustic models trained in a sequence-to-sequence manner, either via a recurrent model or via a structured output learning approach (such as CTC). In contrast to traditional phone (or senone)-based approaches, these "end-to-end" approaches alleviate the need of word pronunciation modeling, and do not require a "forced alignment" step at training time. Phone-based approaches remain however state of the art on classical benchmarks. In this paper, we propose a letter-based speech recognition system, leveraging a ConvNet acoustic model. Key ingredients of the ConvNet are Gated Linear Units and high dropout. The ConvNet is trained to map audio sequences to their corresponding letter transcriptions, either via a classical CTC approach, or via a recent variant called ASG. Coupled with a simple decoder at inference time, our system matches the best existing letter-based systems on WSJ (in word error rate), and shows near state of the art performance on LibriSpeech.

@article{liptchinsky:2017,
  title = {Letter-Based Speech Recognition with Gated ConvNets},
  author = {V. Liptchinsky and G. Synnaeve and R. Collobert},
  journal = {arXiv},
  volume = {abs/1712.09444},
  year = {2017}
}
PDF

2016

R. Collobert, C. Puhrsch and G. Synnaeve. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System. arXiv, volume abs/1609.03193, 2016.

This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC [6] while being simpler. We show competitive results in word error rate on the Librispeech corpus [18] with MFCC features, and promising results from raw waveform.

@article{collobert:2016a,
  title = {Wav2Letter: an End-to-End ConvNet-based Speech Recognition System},
  author = {R. Collobert and C. Puhrsch and G. Synnaeve},
  journal = {arXiv},
  volume = {abs/1609.03193},
  year = {2016}
}
PDF

D. Palaz, G. Synnaeve and R. Collobert. Jointly Learning to Locate and Classify Words using Convolutional Networks. In Interspeech, 2016.

In this paper, we propose a novel approach for weakly-supervised word recognition. Most state of the art automatic speech recognition systems are based on frame-level labels obtained through forced alignments or through a sequential loss. Recently, weakly-supervised trained models have been proposed in vision, that can learn which part of the input is relevant for classifying a given pattern. Our system is composed of a convolutional neural network and a temporal score aggregation mechanism. For each sentence, it is trained using as supervision only some of the words (most frequent) that are present in a given sentence, without knowing their order nor quantity. We show that our proposed system is able to jointly classify and localise words. We also evaluate the system on a keyword spotting task, and show that it can yield similar performance to strong supervised HMM/GMM baseline.

@inproceedings{palaz:2016,
  title = {Jointly Learning to Locate and Classify Words using Convolutional Networks},
  author = {D. Palaz and G. Synnaeve and R. Collobert},
  booktitle = {Interspeech},
  year = {2016}
}
PDF

C. Sun, M. Paluri, R. Collobert, R. Nevatia and L. Bourdev. ProNet: Learning to Propose Object-specific Boxes for Cascaded Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

This paper aims to classify and locate objects accurately and efficiently, without using bounding box annotations. It is challenging as objects in the wild could appear at arbitrary locations and in different scales. In this paper, we propose a novel classification architecture ProNet based on convolutional neural networks. It uses computationally efficient neural networks to propose image regions that are likely to contain objects, and applies more powerful but slower networks on the proposed regions. The basic building block is a multi-scale fully-convolutional network which assigns object confidence scores to boxes at different locations and scales. We show that such networks can be trained effectively using image-level annotations, and can be connected into cascades or trees for efficient object classification. ProNet outperforms previous state-of-the-art significantly on PASCAL VOC 2012 and MS COCO datasets for object classification and point-based localization.

@inproceedings{chen:2016,
  title = {ProNet: Learning to Propose Object-specific Boxes for Cascaded Neural Networks},
  author = {C. Sun and M. Paluri and R. Collobert and R. Nevatia and L. Bourdev},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2016}
}
PDF

R. Collobert, L. Van der Maaten and A. Joulin. Torchnet: An Open-Source Platform for (Deep) Learning Research. In ICML Machine Learning Systems Workshop, 2016.

Torch 7 is a scientific computing platform that supports both CPU and GPU computation, has a lightweight wrapper in a simple scripting language, and provides fast implementations of common algebraic operations. It has become one of the main frameworks for research in (deep) machine learning. Torch does, however, not provide abstractions and boilerplate code for machine-learning experiments. As a result, researchers repeatedly re-implement experimentation logics that are not interoperable. We introduce Torchnet: an open-source framework that provides abstractions and boilerplate logic for machine learning. It encourages modular programming and code re-use, which reduces the chance of bugs, and it makes it straightforward to use asynchronous data loading and efficient multi-GPU computations. Torchnet is written in pure Lua, which makes it easy to install on any architecture with a Torch installation. We envision Torchnet to become a platform to which the community contributes via plugins.

@inproceedings{collobert:2016,
  title = {Torchnet: An Open-Source Platform for (Deep) Learning Research},
  author = {R. Collobert and L. Van der Maaten and A. Joulin},
  booktitle = {ICML Machine Learning Systems Workshop},
  year = {2016}
}
PDF

J. Legrand and R. Collobert. Phrase Representations for Multiword Expressions. In Workshop on Multiword Expressions (MWE), 2016.

Recent works in Natural Language Processing (NLP) using neural networks have focused on learning dense word representations to perform classification tasks. When dealing with phrase prediction problems, is is common practice to use special tagging schemes to identify segments boundaries. This allows these tasks to be expressed as common word tagging problems. In this paper, we propose to learn fixed-size representations for arbitrarily sized chunks. We introduce a model that takes advantage of such representations to perform phrase tagging by directly identifying and classifying phrases. We evaluate our approach on the task of multiword expression (MWE) tagging and show that our model outperforms the stateof-the-art model for this task.

@inproceedings{legrand:2016c,
  title = {Phrase Representations for Multiword Expressions},
  author = {J. Legrand and R. Collobert},
  booktitle = {Workshop on Multiword Expressions (MWE)},
  year = {2016}
}
PDF

J. Legrand and R. Collobert. Deep Neural Networks for Syntactic Parsing of Morphologically Rich Languages. In Association for Computational Linguistics (ACL), 2016.

Morphologically rich languages (MRL) are languages in which much of the structural information is contained at the wordlevel, leading to high level word-form variation. Historically, syntactic parsing has been mainly tackled using generative models. These models assume input features to be conditionally independent, making difficult to incorporate arbitrary features. In this paper, we investigate the greedy discriminative parser described in (Legrand and Collobert, 2015), which relies on word embeddings, in the context of MRL. We propose to learn morphological embeddings and propagate morphological information through the tree using a recursive composition procedure. Experiments show that such embeddings can dramatically improve the average performance on different languages. Moreover, it yields state-of-the art performance for a majority of languages.

@inproceedings{legrand:2016b,
  title = {Deep Neural Networks for Syntactic Parsing of Morphologically Rich Languages},
  author = {J. Legrand and R. Collobert},
  booktitle = {Association for Computational Linguistics (ACL)},
  year = {2016}
}
PDF

J. Legrand, M. Auli and R. Collobert. Neural Network-based Word Alignment through Score Aggregation. In Workshop on Machine Translation (WMT), 2016.

We present a simple neural network for word alignment that builds source and target word window representations to compute alignment scores for sentence pairs. To enable unsupervised training, we use an aggregation operation that summarizes the alignment scores for a given target word. A soft-margin objective increases scores for true target words while decreasing scores for target words that are not present. Compared to the popular Fast Align model, our approach improves alignment accuracy by 7 AER on English-Czech, by 6 AER on Romanian-English and by 1.7 AER on English-French alignment.

@inproceedings{legrand:2016a,
  title = {Neural Network-based Word Alignment through Score Aggregation},
  author = {J. Legrand and M. Auli and R. Collobert},
  booktitle = {Workshop on Machine Translation {(WMT)}},
  year = {2016}
}
PDF

P. H. O. Pinheiro, T. Y. Lin, R. Collobert and P. Dollar. Learning to Refine Object Segments. In European Conference on Computer Vision (ECCV), 2016.

In this work we propose to augment feedforward nets for object segmentation with a novel top-down refinement approach. The resulting bottom-up/top-down architecture is capable of efficiently generating high-fidelity object masks. Similarly to skip connections, our approach leverages features at all layers of the net. Unlike them, our approach does not attempt to output independent predictions at each layer. Instead, we first output a coarse ‘mask encoding’ in a feedforward pass, then refine this mask encoding in a top-down pass utilizing features at successively lower layers.

@inproceedings{pinheiro:2016,
  title = {Learning to Refine Object Segments},
  author = {P. H. O. Pinheiro and T. Y. Lin and R. Collobert and P. Dollar},
  booktitle = {European Conference on Computer Vision {(ECCV)}},
  year = {2016}
}
PDF

2015

P. H. O. Pinheiro, R. Collobert and P. Dollar. Learning to Segment Object Candidates. In Advances in Neural Information Processing Systems (NIPS), 2015.

In this paper, we propose a new way to generate object proposals, introducing an approach based on a discriminative convolutional network. Our model is trained jointly with two objectives: given an image patch, the first part of the system outputs a class-agnostic segmentation mask, while the second part of the system outputs the likelihood of the patch being centered on a full object. At test time, the model is efficiently applied on the whole test image and generates a set of segmentation masks, each of them being assigned with a corresponding object likelihood score.

@inproceedings{pinheiro:2015b,
  title = {Learning to Segment Object Candidates},
  author = {P. H. O. Pinheiro and R. Collobert and P. Dollar},
  booktitle = {Advances in Neural Information Processing Systems {(NIPS)}},
  year = {2015}
}
PDF

P. H. O. Pinheiro and R. Collobert. From Image-level to Pixel-level Labeling with Convolutional Networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

We are interested in inferring object segmentation by leveraging only object class information, and by considering only minimal priors on the object segmentation task. This problem could be viewed as a kind of weakly supervised segmentation task, and naturally fits the Multiple Instance Learning (MIL) framework: every training image is known to have (or not) at least one pixel corresponding to the image class label, and the segmentation task can be rewritten as inferring the pixels belonging to the class of the object (given one image, and its object class). We propose a Convolutional Neural Network-based model, which is constrained during training to put more weight on pixels which are important for classifying the image. We show that at test time, the model has learned to discriminate the right pixels well enough, such that it performs very well on an existing segmentation benchmark, by adding only few smoothing priors. Our system is trained using a subset of the Imagenet dataset and the segmentation experiments are performed on the challenging Pascal VOC dataset (with no fine-tuning of the model on Pascal VOC). Our model beats the state of the art results in weakly supervised object segmentation task by a large margin. We also compare the performance of our model with state of the art fully-supervised segmentation approaches.

@inproceedings{pinheiro:2015a,
  title = {From Image-level to Pixel-level Labeling with Convolutional Networks},
  author = {P. H. O. Pinheiro and R. Collobert},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2015}
}
PDF

R. Lebret and R. Collobert. N-gram-Based Low-Dimensional Representation for Document Classification. In International Conference on Learning Representations (ICLR), 2015.

The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize documents in a lower dimension with the least semantic information loss. Some semantic information is nevertheless always lost, since only words are considered. Instead, we aim at using information coming from n-grams to overcome this limitation, while remaining in a low-dimension space. Many approaches, such as the Skip-gram model, provide good word vector representations very quickly. We propose to average these representations to obtain representations of n-grams. All n-grams are thus embedded in a same semantic space. A K-means clustering can then group them into semantic concepts. The number of features is therefore dramatically reduced and documents can be represented as bag of semantic concepts. We show that this model outperforms LSA and LDA on a sentiment classification task, and yields similar results than a traditional BOW-model with far less features.

@inproceedings{lebret:2015c,
  title = {N-gram-Based Low-Dimensional Representation for Document Classification},
  author = {R. Lebret and R. Collobert},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2015}
}
PDF

R. Lebret and R. Collobert. Phrase-Based Image Captioning. In International Conference on Machine Learning (ICML), 2015.

Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representation (generated from a previously trained Convolutional Neural Network) and phrases that are used to described them. The system is then able to infer phrases from a given image sample. Based on caption syntax statistics, we propose a simple language model that can produce relevant descriptions for a given test image using the phrases inferred. Our approach, which is considerably simpler than state-of-the-art models, achieves comparable results in two popular datasets for the task: Flickr30k and the recently proposed Microsoft COCO.

@inproceedings{lebret:2015b,
  title = {Phrase-Based Image Captioning},
  author = {R. Lebret and R. Collobert},
  booktitle = {International Conference on Machine Learning (ICML)},
  year = {2015}
}
PDF

D. Palaz, M. Magimai-Doss and R. Collobert. Analysis of CNN-based Speech Recognition System using Raw Speech as Input. In 16th Annual Conference of the International Speech Communication Association (Interspeech), 2015.

Automatic speech recognition systems typically model the relationship between the acoustic speech signal and the phones in two separate steps: feature extraction and classifier training. In our recent works, we have shown that, in the framework of convolutional neural networks (CNN), the relationship between the raw speech signal and the phones can be directly modeled and ASR systems competitive to standard approach can be built. In this paper, we first analyze and show that, between the first two convolutional layers, the CNN learns (in parts) and models the phone-specific spectral envelope information of 24 ms speech. Given that we show that the CNN-based approach yields ASR trends similar to standard short-term spectral based ASR system under mismatched (noisy) conditions, with the CNN-based approach being more robust.

@inproceedings{palaz:2015b,
  title = {Analysis of CNN-based Speech Recognition System using Raw Speech as Input},
  author = {D. Palaz and M. Magimai-Doss and R. Collobert},
  booktitle = {16th Annual Conference of the International Speech Communication Association (Interspeech)},
  year = {2015}
}
PDF

J. Legrand and R. Collobert. Joint RNN-Based Greedy Parsing and Word Composition. In International Conference on Learning Representations (ICLR), 2015.

This paper introduces a greedy parser based on neural networks, which leverages a new compositional sub-tree representation. The greedy parser and the compositional procedure are jointly trained, and tightly depends on each-other. The composition procedure outputs a vector representation which summarizes syntactically (parsing tags) and semantically (words) sub-trees. Composition and tagging is achieved over continuous (word or tag) representations, and recurrent neural networks. We reach F1 performance on par with well-known existing parsers, while having the advantage of speed, thanks to the greedy nature of the parser. We provide a fully functional implementation of the method described in this paper.

@inproceedings{legrand:2015,
  title = {Joint RNN-Based Greedy Parsing and Word Composition},
  author = {J. Legrand and R. Collobert},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2015}
}
PDF DeepParse

D. Palaz, M. Magimai-Doss and R. Collobert. Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal. In 40th International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.

State-of-the-art automatic speech recognition systems model the relationship between acoustic speech signal and phone classes in two stages, namely, extraction of spectral-based features based on prior knowledge followed by training of acoustic model, typically an artificial neural network (ANN). In our recent work, it was shown that Convolutional Neural Networks (CNNs) can model phone classes from raw acoustic speech signal, reaching performance on par with other existing feature-based approaches. This paper extends the CNN-based approach to large vocabulary speech recognition task. More precisely, we compare the CNN-based approach against the conventional ANN-based approach on Wall Street Journal corpus. Our studies show that the CNN-based approach achieves better performance than the conventional ANN-based approach with as many parameters. We also show that the features learned from raw speech by the CNN-based approach could generalize across different databases.

@inproceedings{palaz:2015a,
  title = {Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal},
  author = {D. Palaz and M. Magimai-Doss and R. Collobert},
  booktitle = {40th International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year = {2015}
}
PDF

R. Lebret and R. Collobert. Rehabilitation of Count-based Models for Word Vector Representations. In Conference on Intelligent Text Processing and Computational Linguistics (CICLing), 2015.

Recent works on word representations mostly rely on predictive models. Distributed word representations (aka word embeddings) are trained to optimally predict the contexts in which the corresponding words tend to appear. Such models have succeeded in capturing word similarities as well as semantic and syntactic regularities. Instead, we aim at reviving interest in a model based on counts. We present a systematic study of the use of the Hellinger distance to extract semantic representations from the word co-occurrence statistics of large text corpora. We show that this distance gives good performance on word similarity and analogy tasks, with a proper type and size of context, and a dimensionality reduction based on a stochastic low-rank approximation. Besides being both simple and intuitive, this method also provides an encoding function which can be used to infer unseen words or phrases. This becomes a clear advantage compared to predictive models which must train these new words.

@inproceedings{lebret:2015a,
  title = {Rehabilitation of Count-based Models for Word Vector Representations},
  author = {R. Lebret and R. Collobert},
  booktitle = {Conference on Intelligent Text Processing and Computational Linguistics (CICLing)},
  year = {2015}
}
PDF

2014

D. Palaz, M. Magimai-Doss and R. Collobert. Joint phoneme segmentation inference and classification using CRFs. In 2nd Global Conference on Signal and Information Processing (GlobalSIP), 2014.

State-of-the-art phoneme sequence recognition systems are based on hybrid hidden Markov model/artificial neural networks (HMM/ANN) framework. In this framework, the local classifier, ANN, is typically trained using Viterbi expectation-maximization algorithm, which involves two separate steps: phoneme sequence segmentation and training of ANN. In this paper, we propose a CRF based phoneme sequence recognition approach that simultaneously infers the phoneme segmentation and classifies the phoneme sequence. More specifically, the phoneme sequence recognition system consists of a local classifier ANN followed by a conditional random field (CRF) whose parameters are trained jointly, using a cost function that discriminates the true phoneme sequence against all competing sequences. In order to efficiently train such a system we introduce a novel CRF based segmentation using acyclic graph. We study the viability of the proposed approach on TIMIT phoneme recognition task. Our studies show that the proposed approach is capable of achieving performance similar to standard hybrid HMM/ANN and ANN/CRF systems where the ANN is trained with manual segmentation.

@inproceedings{palaz:2014,
  title = {Joint phoneme segmentation inference and classification using CRFs},
  author = {D. Palaz and M. Magimai-Doss and R. Collobert},
  booktitle = {2nd Global Conference on Signal and Information Processing (GlobalSIP)},
  year = {2014}
}
PDF

P. H. O. Pinheiro and R. Collobert. Recurrent Convolutional Neural Networks for Scene Labeling. In Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.

The goal of the scene labeling task is to assign a class label to each pixel in an image. To ensure a good visual coherence and a high class accuracy, it is essential for a model to capture long range (pixel) label dependencies in images. In a feed-forward architecture, this can be achieved simply by considering a sufficiently large input context patch, around each pixel to be labeled. We propose an approach that consists of a recurrent convolutional neural network which allows us to consider a large input context while limiting the capacity of the model. Contrary to most standard approaches, our method does not rely on any segmentation technique nor any task-specific features. The system is trained in an end-to-end manner over raw pixels, and models complex spatial dependencies with low inference cost. As the context size increases with the built-in recurrence, the system identifies and corrects its own errors. Our approach yields state-of-the-art performance on both the Stanford Background Dataset and the SIFT Flow Dataset, while remaining very fast at test time.

@inproceedings{pinheiro:2014,
  title = {Recurrent Convolutional Neural Networks for Scene Labeling},
  author = {P. H. O. Pinheiro and R. Collobert},
  booktitle = {Proceedings of the 31st International Conference on Machine Learning (ICML)},
  year = {2014}
}
PDF

J. Legrand and R. Collobert. Recurrent Greedy Parsing with Neural Networks. In Proceedings of the European Conference on Machine Learning, Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2014.

In this paper, we propose a bottom-up greedy and purely discriminative syntactic parsing approach that relies only on a few simple features. The core of the architecture is a simple neural network architecture, trained with an objective function similar to a Conditional Random Field. This parser leverages continuous word vector representations to model the conditional distributions of context-aware syntactic rules. The learned distribution rules are naturally smoothed, thanks to the continuous nature of the input features and the model. Generalization accuracy compares very well with the existing generative or discriminative (non-reranking) parsers (despite the greedy nature of our approach), and prediction speed is very fast.

@inproceedings{legrand:2014,
  title = {Recurrent Greedy Parsing with Neural Networks},
  author = {J. Legrand and R. Collobert},
  booktitle = {Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD)},
  year = {2014}
}
PDF

R. Lebret and R. Collobert. Word Embeddings through Hellinger PCA. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 482-490, Association for Computational Linguistics, 2014.

Word embeddings resulting from neural language models have been shown to be a great asset for a large variety of NLP tasks. However, such architecture might be difficult and time-consuming to train. Instead, we propose to drastically simplify the word embeddings computation through a Hellinger PCA of the word co-occurence matrix. We compare those new word embeddings with some well-known embeddings on named entity recognition and movie review tasks and show that we can reach similar or even better performance. Although deep learning is not really necessary for generating good word embeddings, we show that it can provide an easy way to adapt embeddings to specific tasks.

@inproceedings{lebret:2014,
  title = {Word Embeddings through Hellinger PCA},
  author = {R. Lebret and R. Collobert},
  booktitle = {Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  publisher = {Association for Computational Linguistics},
  pages = {482--490},
  year = {2014}
}
PDF Word Embeddings

2013

D. Palaz, R. Collobert and M. Magimai-Doss. Estimating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks. In Interspeech, 2013.

In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then modeling the acoustic features with an ANN. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, in the framework of convolutional neural networks (CNNs), this paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates. On TIMIT phoneme recognition task, we study different ANN architectures to show the benefit of CNNs and compare the proposed approach against conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron. Our studies show that the proposed approach can yield comparable or better phoneme recognition performance when compared to the conventional approach. It indicates that CNNs can learn features relevant for phoneme classification automatically from the raw speech signal.

@inproceedings{palaz:2013,
  title = {Estimating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks},
  author = {D. Palaz and R. Collobert and M. Magimai-Doss},
  booktitle = {Interspeech},
  year = {2013}
}
PDF

A. Bordes, L. Bottou, R. Collobert, D. Roth, J. Weston and L. Zettlemoyer. Introduction to the Special Issue on Learning Semantics. Machine Learning, 94:127-131, 2013.

@article{bordes:2013,
  title = {Introduction to the Special Issue on Learning Semantics},
  author = {A. Bordes and L. Bottou and R. Collobert and D. Roth and J. Weston and L. Zettlemoyer},
  journal = {Machine Learning},
  volume = {94},
  publisher = {Springer},
  pages = {127--131},
  year = {2013}
}
PDF Springer

M. Yazdani, R. Collobert and A. Popescu-Belis. Learning to Rank on Network Data. In Eleventh Workshop on Mining and Learning with Graphs, ACM, 2013.

This paper proposes a method for learning to rank over network data. The ranking is performed with respect to a query object which can be part of the network or out- side it. The ranking method makes use of the features of the nodes as well as the existing links between them. First, a neighbors-aware ranker is trained using a large margin pairwise loss function. Neighbors-aware ranker uses target neighbors scores in addition to objects' content and therefore, the scoring is consistent in every neighborhood. Then, collective inference is performed using an iterative ranking algorithm, which propagates the results of rankers over the network. By formulating link prediction as a ranking problem, the method is tested on several networks, with pa- pers/citations and webpages/hyperlinks. The results show that the proposed algorithm, which uses both the attributes of the nodes and the structure of the links, outperforms several other methods: a content-only ranker, a link-only one, a random walk method, a relational topic model, and a method based on the weighted number of common neighbors. In addition, the propagation algorithm improves results even when the query object is not part of the network, and scales efficiently to large networks.

@inproceedings{collobert:2013,
  title = {Learning to Rank on Network Data},
  author = {M. Yazdani and R. Collobert and A. Popescu-Belis},
  booktitle = {Eleventh Workshop on Mining and Learning with Graphs},
  publisher = {ACM},
  year = {2013}
}
PDF

2012

R. Collobert, K. Kavukcuoglu and C. Farabet. Implementing Neural Networks Efficiently. In Neural Networks: Tricks of the Trade, G. Montavon, G. Orr and K-R. Muller (Ed), Springer, 2012.

Neural networks and machine learning algorithms in general require a flexible environment where new algorithm prototypes and experiments can be set up as quickly as possible with best possible computational performance. To that end, we provide a new framework called Torch7, that is especially suited to achieve both of these competing goals. Torch7 is a versatile numeric computing framework and machine learning library that extends a very lightweight and powerful programming language Lua. Its goal is to provide a flexible environment to design, train and deploy learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can also easily be interfaced to third-party software thanks to Lua’s light C interface.

@incollection{collobert:2012,
  title = {Implementing Neural Networks Efficiently},
  author = {R. Collobert and K. Kavukcuoglu and C. Farabet},
  booktitle = {Neural Networks: Tricks of the Trade},
  publisher = {Springer},
  editor = {G. Montavon and G. Orr and K-R. Muller},
  year = {2012}
}
PDF Torch 7 Torch 7 Git

J. Weston, F. Ratle, H. Mobahi and R. Collobert. Deep Learning via Semi-Supervised Embedding. In Neural Networks: Tricks of the Trade, G. Montavon, G. Orr and K-R. Muller (Ed), Springer, 2012.

We show how nonlinear embedding algorithms popular for use with "shallow" semi-supervised learning techniques such as kernel methods can be easily applied to deep multi-layer architectures, either as a regularizer at the output layer, or on each layer of the architecture. This trick provides a simple alternative to existing approaches to deep learning whilst yielding competitive error rates compared to those methods, and existing shallow semi-supervised techniques.

@incollection{weston:2012,
  title = {Deep Learning via Semi-Supervised Embedding},
  author = {J. Weston and F. Ratle and H. Mobahi and R. Collobert},
  booktitle = {Neural Networks: Tricks of the Trade},
  publisher = {Springer},
  editor = {G. Montavon and G. Orr and K-R. Muller},
  year = {2012}
}
PDF

2011

R. Collobert, K. Kavukcuoglu and C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop, 2011.

Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua’s light interface.

@inproceedings{collobert:2011c,
  title = {Torch7: A Matlab-like Environment for Machine Learning},
  author = {R. Collobert and K. Kavukcuoglu and C. Farabet},
  booktitle = {BigLearn, NIPS Workshop},
  year = {2011}
}
PDF Torch 7 Torch 7 Git

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493-2537, 2011.

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

@article{collobert:2011b,
  title = {Natural Language Processing (Almost) from Scratch},
  author = {R. Collobert and J. Weston and L. Bottou and M. Karlen and K. Kavukcuoglu and P. Kuksa},
  journal = {Journal of Machine Learning Research},
  volume = {12},
  pages = {2493--2537},
  year = {2011}
}
PDF SENNA

R. Collobert. Deep Learning for Efficient Discriminative Parsing. In AISTATS, 2011.

We propose a new fast purely discriminative algorithm for natural language parsing, based on a “deep” recurrent convolutional graph transformer network (GTN). Assuming a decomposition of a parse tree into a stack of “levels”, the network predicts a level of the tree taking into account predictions of previous levels. Using only few basic text features which leverage word representations from Collobert and Weston (2008), we show similar performance (in F1 score) to existing pure discriminative parsers and existing “benchmark” parsers (like Collins parser, probabilistic context-free grammars based), with a huge speed advantage.

I apologize for the incorrect F1 score I first reported for Carreras et al' parser (90.5% inttead of 91.1%). I confused WSJ sections 23 and 24 performance. Thanks to Michael Collins who reported the bug.

@inproceedings{collobert:2011,
  title = {Deep Learning for Efficient Discriminative Parsing},
  author = {R. Collobert},
  booktitle = {AISTATS},
  year = {2011}
}
PDF SENNA

A. Bordes, J. Weston, R. Collobert and Y. Bengio. Learning Structured Embeddings of Knowledge Bases. In AAAI, 2011.

Many Knowledge Bases (KBs) are now readily available and encompass colossal quantities of information thanks to either a long-term funding effort (e.g. WordNet, OpenCyc) or a collaborative process (e.g. Freebase, DBpedia). However, each of them is based on a different rigorous symbolic framework which makes it hard to use their data in other systems. It is unfortunate because such rich structured knowledge might lead to a huge leap forward in many other areas of AI like nat- ural language processing (word-sense disambiguation, natu- ral language understanding, ...), vision (scene classification, image semantic annotation, ...) or collaborative filtering. In this paper, we present a learning process based on an innovative neural network architecture designed to embed any of these symbolic representations into a more flexible continuous vector space in which the original knowledge is kept and enhanced. These learnt embeddings would allow data from any KB to be easily used in recent machine learning methods for prediction and information retrieval. We illustrate our method on WordNet and Freebase and also present a way to adapt it to knowledge extraction from raw text.

@inproceedings{bordes:2011,
  title = {Learning Structured Embeddings of Knowledge Bases},
  author = {A. Bordes and J. Weston and R. Collobert and Y. Bengio},
  booktitle = {AAAI},
  year = {2011}
}
PDF

2010

P. Kuksa, Y. Qi, B. Bai, R. Collobert, J.Weston, V. Pavlovic and X. Ning. Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level Bio-Relation Extraction. In ECML PKDD, 2010.

Bio-relation extraction (bRE), an important goal in bio-text mining, involves subtasks identifying relationships between bio-entities in text at multiple levels, e.g., at the article, sentence or relation level. A key limitation of current bRE systems is that they are restricted by the availability of annotated corpora. In this work we introduce a semi-supervised approach that can tackle multi-level bRE via string comparisons with mismatches in the string kernel framework. Our string kernel implements an abstraction step, which groups similar words to generate more abstract entities, which can be learnt with unlabeled data. Specifically, two unsupervised models are proposed to capture contextual (local or global) semantic similarities between words from a large unannotated corpus. This Abstraction-augmented String Kernel (ASK) allows for better generalization of patterns learned from annotated data and provides a unified framework for solving bRE with multiple degrees of detail. ASK shows effective improvements over classic string kernels on four datasets and achieves state-of-the-art bRE performance without the need for complex linguistic features.

@inproceedings{qi:2010,
  title = {Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level Bio-Relation Extraction},
  author = {P. Kuksa and Y. Qi and B. Bai and R. Collobert and J.Weston and V. Pavlovic and X. Ning},
  booktitle = {ECML PKDD},
  year = {2010}
}
PDF

A. Bordes, N. Usunier, R. Collobert and J. Weston. Towards Understanding Situated Natural Language. In AISTATS, 2010.

We present a general framework and learning algorithm for the task of concept labeling: each word in a given sentence has to be tagged with the unique physical entity (e.g. person, object or location) or abstract con- cept it refers to. Our method allows both world knowledge and linguistic information to be used during learning and prediction. We show experimentally that we can learn to use world knowledge to resolve ambiguities in language, such as word senses or ref- erence resolution, without the use of handcrafted rules or features.

@inproceedings{bordes:2010,
  title = {Towards Understanding Situated Natural Language},
  author = {A. Bordes and N. Usunier and R. Collobert and J. Weston},
  booktitle = {AISTATS},
  year = {2010}
}
PDF

B. Bai, J. Weston, D. Grangier, R. Collobert, C. Cortes and M. Mohri. Half Transductive Ranking. In Artificial Intelligence and Statistics (AISTATS), 2010.

We study the standard retrieval task of ranking a fixed set of items given a previously unseen query and pose it as the half-transductive ranking problem. Transductive representations (where the vector representation of each example is learned) allow the generation of highly nonlinear embeddings that capture the characteristics of object relationships without relying on a specific choice of features, and require only relatively simple optimization. Unfortunately, they have no direct out-of-sample extension. Inductive approaches on the other hand allow for the representation of unknown queries. We describe algorithms for this setting which have the advantages of both transductive and inductive approaches, and can be applied in unsupervised (either reconstruction-based or graph-based) and supervised ranking setups. We show empirically that our methods give strong performance on all three tasks.

@inproceedings{bai:2010,
  title = {Half Transductive Ranking},
  author = {B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri},
  booktitle = {Artificial Intelligence and Statistics (AISTATS)},
  year = {2010}
}
PDF

2009

A. Bordes, N. Usunier, J. Weston and R. Collobert. Learning to Disambiguate Natural Language Using World Knowledge. In NIPS workshop on Grammar Induction, Representation of Language and Language Learning, 2009.

We present a general framework and learning algorithm for the task of concept labeling: each word in a given sentence has to be tagged with the unique physical entity (e.g. person, object or location) or abstract concept it refers to. Our method allows both world knowledge and linguistic information to be used during learning and prediction. We show experimentally that we can handle natural language and learn to use world knowledge to resolve ambiguities in language, such as word senses or coreference, without the use of hand-crafted rules or features.

@inproceedings{bordes:2009,
  title = {Learning to Disambiguate Natural Language Using World Knowledge},
  author = {A. Bordes and N. Usunier and J. Weston and R. Collobert},
  booktitle = {NIPS workshop on Grammar Induction, Representation of Language and Language Learning},
  year = {2009}
}
PDF

B. Bai, J. Weston, D. Grangier, R. Collobert, C. Cortes and M. Mohri. Ranking with Half Transductive Models. In NIPS Workshop on Advances in Ranking, 2009.

We study the standard retrieval task of ranking a fixed set of documents given a previously unseen query and pose it as the half-transductive ranking problem. The task is partly transductive as the document set is fixed. Existing transductive approaches are natural non-linear methods for this set, but have no direct out-of-sample extension. Functional approaches, on the other hand, can be applied to the unseen queries, but fail to exploit the availability of the document set in its full extent. This work introduces a half-transductive approach to benefit from the advantages of both transductive and functional approaches and show its empirical advantage in supervised ranking setups.

@inproceedings{bai:2009d,
  title = {Ranking with Half Transductive Models},
  author = {B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri},
  booktitle = {NIPS Workshop on Advances in Ranking},
  year = {2009}
}
PDF

B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, C. Cortes and M. Mohri. Polynomial Semantic Indexing. In Advances in Neural Information Processing Systems (NIPS), 2009.

We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods.

@inproceedings{bai:2009c,
  title = {Polynomial Semantic Indexing},
  author = {B. Bai and J. Weston and D. Grangier and R. Collobert and K. Sadamasa and Y. Qi and C. Cortes and M. Mohri},
  booktitle = {Advances in Neural Information Processing Systems {(NIPS)}},
  year = {2009}
}
PDF

R. Collobert and J. Weston. Deep Learning in Natural Language Processing. Tutorial at NIPS, 2009.

This tutorial will describe recent advances in deep learning techniques for Natural Language Processing (NLP). Traditional NLP approaches favour shallow systems, possibly cascaded, with adequate hand-crafted features. In constrast, we are interested in end-to-end architectures: these systems include several feature layers, with increasing abstraction at each layer. Compared to shallow systems, these feature layers are learnt for the task of interest, and do not require any engineering. We will show how neural networks are naturally well suited for end-to-end learning in NLP tasks. We will study multi-tasking different tasks, new semi-supervised learning techniques adapted to these deep architectures, and review end-to-end structured output learning. Finally, we will highlight how some of these advances can be applied to other fields of research, like computer vision, as well.

@misc{collobert:2009,
  title = {Deep Learning in Natural Language Processing},
  author = {R. Collobert and J. Weston},
  howpublished = {Tutorial at NIPS},
  year = {2009}
}
PDF NIPS Tutorials

Y. Qi, P. Kuksa, R. Collobert, K. Sadamasa, K. Kavukcuoglu and J. Weston. Semi-Supervised Sequence Labeling with Self-Learned Features. In IEEE International Conference on Data Mining (ICDM), 2009.

Typical information extraction (IE) systems can be seen as tasks assigning labels to words in a natural language sequence. The performance is restricted by the availability of labeled words. To tackle this issue, we propose a semi-supervised approach to improve the sequence labeling procedure in IE through a class of algorithms with self-learned features (SLF). A supervised classifier can be trained with annotated text sequences and used to classify each word in a large set of unannotated sentences. By averaging predicted labels over all cases in the unlabeled corpus, SLF training builds class label distribution patterns for each word (or word attribute) in the dictionary and re-trains the current model iteratively adding these distributions as extra word features. Basic SLF models how likely a word could be assigned to target class types. Several extensions are proposed, such as learning words’ class boundary distributions. SLF exhibits robust and scalable behaviour and is easy to tune. We applied this approach on four classical IE tasks: named entity recognition (German and English), part-of-speech tagging (English) and one gene name recognition corpus. Experimental results show effective improvements over the supervised baselines on all tasks. In addition, when compared with the closely related self-training idea, this approach shows favorable advantages.

@inproceedings{qi:2009a,
  title = {Semi-Supervised Sequence Labeling with Self-Learned Features},
  author = {Y. Qi and P. Kuksa and R. Collobert and K. Sadamasa and K. Kavukcuoglu and J. Weston},
  booktitle = {IEEE International Conference on Data Mining ({ICDM})},
  year = {2009}
}
PDF

B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle and K. Weinberger. Learning to Rank with (a Lot of) Word Features. Journal of Information Retrieval, volume Special Issue on Learning to Rank for Information Retrieval, 2009.

In this article we present Supervised Semantic Indexing (SSI) which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as cross-language retrieval or online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, correlated feature hashing (CFH) and sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

@article{bai:2009b,
  title = {Learning to Rank with (a Lot of) Word Features},
  author = {B. Bai and J. Weston and D. Grangier and R. Collobert and K. Sadamasa and Y. Qi and O. Chapelle and K. Weinberger},
  journal = {Journal of Information Retrieval},
  volume = {Special Issue on Learning to Rank for Information Retrieval},
  year = {2009}
}
PDF

T. Barnickel, J. Weston, R. Collobert, H-W. Mewes and V. Stümpflen. Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts. PLoS one, 4(7), July 2009.

To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (‘Semantic Extraction using a Neural Network Architecture’), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.

@article{barnickel:2009,
  title = {Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts},
  author = {T. Barnickel and J. Weston and R. Collobert and H-W. Mewes and V. St\"umpflen},
  journal = {PLoS one},
  volume = {4},
  number = {7},
  month = {July},
  year = {2009}
}
PDF PLoS one

Y. Qi, R. Collobert, P. Kuksa, K. Kavukcuoglu and J. Weston. Combining Labeled and Unlabeled Data with Word-Class Distribution Learning. In The 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.

We describe a novel simple and highly scalable semi-supervised method called Word-Class Distribution Learning (WCDL), and apply it the task of information extraction (IE) by utilizing unlabeled sentences to improve supervised classification methods. WCDL iteratively builds class label distributions for each word in the dictionary by averaging predicted labels over all cases in the unlabeled corpus, and re-training a base classifier adding these distributions as word features. In contrast, traditional self-training or co-training methods add self-labeled examples (rather than features) which can degrade performance due to incestuous learning bias. WCDL exhibits robust behavior, and has no difficult parameters to tune. We applied our method on German and English name en- tity recognition (NER) tasks. WCDL shows improvements over self-training, multi-task semi-supervision or supervision alone, in particular yielding a state-of-the art 75.72 F1 score on the German NER task.

@inproceedings{qi:2009,
  title = {Combining Labeled and Unlabeled Data with Word-Class Distribution Learning},
  author = {Y. Qi and R. Collobert and P. Kuksa and K. Kavukcuoglu and J. Weston},
  booktitle = {The 18th ACM Conference on Information and Knowledge Management ({CIKM})},
  year = {2009}
}
PDF

B. Bai, J. Weston, D. Grangier, R. Collobert, O. Chapelle and K. Weinberger. Supervised Semantic Indexing. In The 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.

In this article we propose Supervised Semantic Indexing (SSI) an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH). We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

@inproceedings{bai:2009a,
  title = {Supervised Semantic Indexing},
  author = {B. Bai and J. Weston and D. Grangier and R. Collobert and O. Chapelle and K. Weinberger},
  booktitle = {The 18th ACM Conference on Information and Knowledge Management ({CIKM})},
  year = {2009}
}
PDF

H. Mobahi, R. Collobert and J. Weston. Deep Learning from Temporal Coherence in Video. In International Conference on Machine Learning, ICML, 2009.

This work proposes a learning method for deep architectures that takes advantage of sequential data, in particular from the temporal coherence that naturally exists in unlabeled video recordings. That is, two successive frames are likely to contain the same object or objects. This coherence is used as a supervisory signal over the unlabeled data, and is used to improve the performance on a supervised task of interest. We demonstrate the effectiveness of this method on some pose invariant object and face recognition tasks.

@inproceedings{mobahi:2009,
  title = {Deep Learning from Temporal Coherence in Video},
  author = {H. Mobahi and R. Collobert and J. Weston},
  booktitle = {International Conference on Machine Learning, {ICML}},
  year = {2009}
}
PDF

Y. Bengio, J. Louradour, R. Collobert and J. Weston. Curriculum Learning. In International Conference on Machine Learning, ICML, 2009.

Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them curriculum learning. In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved by using a particular curriculum, i.e., the selection and order of training examples. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

Our large-scale language model in our unified NLP paper has been trained using the curriculum idea.

@inproceedings{bengio:2009,
  title = {Curriculum Learning},
  author = {Y. Bengio and J. Louradour and R. Collobert and J. Weston},
  booktitle = {International Conference on Machine Learning, {ICML}},
  year = {2009}
}
PDF

B. Bai, J. Weston, R. Collobert and D. Grangier. Supervised Semantic Indexing. In 31st European Conference on Information Retrieval, 2009.

We present a class of models that are discriminatively trained to directly map from the word content in a query-document or document- document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, pol- ysemy). However, unlike LSI our models are trained with a supervised signal directly on the task of interest, which we argue is the reason for our superior results. We provide an empirical study on Wikipedia documents, using the links to define document-document or query-document pairs, where we obtain state-of-the-art performance using our method.

@inproceedings{bbai:2009,
  title = {Supervised Semantic Indexing},
  author = {B. Bai and J. Weston and R. Collobert and D. Grangier},
  booktitle = {31st European Conference on Information Retrieval},
  year = {2009}
}
PDF

2008

R. Collobert. Torch. NIPS Workshop on Machine Learning Open Source Software, 2008.

Torch provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and very efficient, thanks to a simple-yet-powerful fast scripting language (Lua), and a underlying C/C++ implementation. Torch is easily extensible and has been shown to scale to very large applications.

The slides have been made in Torch!

@misc{collobert:2008a,
  title = {Torch},
  author = {R. Collobert},
  howpublished = {NIPS Workshop on Machine Learning Open Source Software},
  year = {2008}
}
MOV Torch 5 Torch 3

M. Karlen, J. Weston, A. Erkan and R. Collobert. Large Scale Manifold Transduction. In International Conference on Machine Learning, ICML, 2008.

We show how the regularizer of Transductive Support Vector Machines (TSVM) can be trained by stochastic gradient descent for linear models and multi-layer architectures. The resulting methods can be trained online, have vastly superior training and testing speed to existing TSVM algorithms, can encode prior knowledge in the network architecture, and obtain competitive error rates. We then go on to propose a natural generalization of the TSVM loss function that takes into account neighborhood and manifold information directly, unifying the two-stage Low Density Separation method into a single criterion, and leading to state-of-the-art results.

@inproceedings{karlen:2008,
  title = {Large Scale Manifold Transduction},
  author = {M. Karlen and J. Weston and A. Erkan and R. Collobert},
  booktitle = {International Conference on Machine Learning, {ICML}},
  year = {2008}
}
PDF

R. Collobert and J. Weston. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In International Conference on Machine Learning, ICML, 2008.

We describe a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense (grammatically and semantically) using a language model. The entire network is trained jointly on all these tasks using weight-sharing, an instance of multitask learning. All the tasks use labeled data except the language model which is learnt from unlabeled text and represents a novel form of semi-supervised learning for the shared tasks. We show how both multitask learning and semi-supervised learning improve the generalization of the shared tasks, resulting in state-of-the-art performance.

@inproceedings{collobert:2008,
  title = {A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning},
  author = {R. Collobert and J. Weston},
  booktitle = {International Conference on Machine Learning, {ICML}},
  year = {2008}
}
PDF SENNA

J. Weston, F. Rattle and R. Collobert. Deep Learning via Semi-Supervised Embedding. In International Conference on Machine Learning, ICML, 2008.

We show how nonlinear embedding algorithms popular for use with shallow semi-supervised learning techniques such as kernel methods can be applied to deep multi-layer architectures, either as a regularizer at the output layer, or on each layer of the architecture. This provides a simple alternative to existing approaches to deep learning whilst yielding competitive error rates compared to those methods, and existing shallow semi-supervised techniques.

@inproceedings{weston:2008,
  title = {Deep Learning via Semi-Supervised Embedding},
  author = {J. Weston and F. Rattle and R. Collobert},
  booktitle = {International Conference on Machine Learning, {ICML}},
  year = {2008}
}
PDF

2007

R. Collobert and J. Weston. Fast Semantic Extraction Using a Novel Neural Network Architecture. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 560-567, June 2007.

We describe a novel neural network architecture for the problem of semantic role labeling. Many current solutions are complicated, consist of several stages and handbuilt features, and are too slow to be applied as part of real applications that require such semantic labels, partly because of their use of a syntactic parser (Pradhan et al., 2004; Gildea and Jurafsky, 2002). Our method instead learns a direct mapping from source sentence to semantic tags for a given predicate without the aid of a parser or a chunker. Our resulting system obtains accuracies comparable to the current state-of-the-art at a fraction of the computational cost.

@inproceedings{collobert:2007,
  title = {Fast Semantic Extraction Using a Novel Neural Network Architecture},
  author = {R. Collobert  and  J. Weston},
  booktitle = {Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics},
  pages = {560--567},
  month = {June},
  year = {2007}
}
PDF SENNA

2006

J. Weston, R. Collobert, F. Sinz, L. Bottou and V. Vapnik. Inference with the Universum. In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006), pages 1009-1016, ACM Press, 2006.

In this paper we study a new framework introduced by Vapnik (1998; 2006) that is an alternative capacity concept to the large margin approach. In the particular case of binary classification, we are given a set of labeled examples, and a collection of rage the Universum by maximizing the number of observed contradictions, and show experimentally that this approach delivers accuracy improvements over using labeled data alone.

@inproceedings{weston:2006,
  title = {Inference with the Universum},
  author = {J. Weston and R. Collobert and F. Sinz  and L. Bottou and V. Vapnik},
  booktitle = {Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006)},
  publisher = {ACM Press},
  pages = {1009--1016},
  location = {Pittsburgh, Pennsylvania},
  year = {2006}
}
PDF UniverSVM SVQP2 DC algorithms ABCDETC

R. Collobert, F. Sinz, J. Weston and L. Bottou. Large Scale Transductive SVMs. Journal of Machine Learning Research, 7:1687-1712, September 2006.

We show how the Concave-Convex Procedure can be applied to Transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the rst time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction.html.

This is a derivative of the original paper Trading Convexity for Scalability.

@article{collobert:2006a,
  title = {Large Scale Transductive SVMs},
  author = {R. Collobert and F. Sinz and J. Weston and L. Bottou},
  journal = {Journal of Machine Learning Research},
  volume = {7},
  pages = {1687-1712},
  month = {September},
  year = {2006}
}
PDF UniverSVM SVQP2 DC algorithms

R. Collobert, F. Sinz, J. Weston and L. Bottou. Trading convexity for scalability. In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006), pages 201-208, ACM Press, 2006.

Convex learning algorithms, such as Support Vector Machines (SVMs), are often seen as highly desirable because they offer strong practical properties and are amenable to theoretical analysis. However, in this work we show how non-convexity can provide scalability advantages over convexity. We show how concave-convex programming can be applied to produce (i) faster SVMs where training errors are no longer support vectors, and (ii) much faster Transductive SVMs.

This paper received the best paper award at ICML 2006 conference.

@inproceedings{collobert:2006,
  title = {Trading convexity for scalability},
  author = {R. Collobert and F. Sinz and J. Weston and L. Bottou},
  booktitle = {Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006)},
  publisher = {ACM Press},
  pages = {201--208},
  location = {Pittsburgh, Pennsylvania},
  year = {2006}
}
PDF UniverSVM SVQP2 DC algorithms

2004

R. Collobert. Large Scale Machine Learning. Université Paris VI, 2004.

This thesis aims to address machine learning in general, with a particular focus on large models and large databases. After introducing the learning problem in a formal way, we first review several important machine learning algorithms, particularly Multi Layer Perceptrons, Mixture of Experts and Support Vector Machines. We then present a training method for Support Vector Machines, adapted to reasonably large datasets. However the training of such a model is still intractable on very large databases. We thus propose a divide and conquer approach based on a kind of Mixture of Experts in order to break up the training problem into small pieces, while keeping good generalization performance. This mixture model can be applied to any kind of existing machine learning algorithm. Even though it performs well in practice the major drawback of this algorithm is the number of hyper-parameters to tune, which makes it difficult to use. We thus prefer afterward to focus on training improvements for Multi Layer Perceptrons, which are easier to tune, and more suitable than Support Vector Machines for large databases. We finally show that the margin idea introduced with Support Vector Machines can be applied to a certain class of Multi Layer Perceptrons, which leads to a fast algorithm with powerful generalization performance.

This is my PhD thesis. I did my PhD both at IDIAP and Université de Montréal. I defended at Université de Paris VI, in the LIP6 lab.

@phdthesis{collobert:2004b,
  title = {Large Scale Machine Learning},
  author = {R. Collobert},
  school = {Universit\'e Paris {VI}},
  year = {2004}
}
PDF

R. Collobert and S. Bengio. Links Between Perceptrons, MLPs and SVMs. In International Conference on Machine Learning, ICML, 2004.

We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin. These ideas are extended afterward to the case of MLPs. Moreover, under some assumptions it also appears that MLPs are a kind of mixture of SVMs, maximizing the margin in the hidden layer space. Finally, we present a very simple MLP based on the previous findings, which yields better performances in generalization and speed than the other models.

Neural networks with the right criterion (like an hinge loss) work well, with better scaling properties than SVMs... Also, each neuron in the hidden layer of a neural network acts interestingly as a kind of SVM, on a subset of the training set.

@inproceedings{collobert:2004a,
  title = {Links Between Perceptrons, {MLPs} and {SVMs}},
  author = {R. Collobert and S. Bengio},
  booktitle = {International Conference on Machine Learning, {ICML}},
  year = {2004}
}
PDF

R. Collobert and S. Bengio. A Gentle Hessian for Efficient Gradient Descent. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2004.

Several second-order optimization methods for gradient descent algorithms have been proposed over the years, but they usually need to compute the inverse of the Hessian of the cost function (or an approximation of this inverse) during training. In most cases, this leads to an O(n^2) cost in time and space per iteration, where n is the number of parameters, which is prohibitive for large n. We propose instead a study of the Hessian before training. Based on a second order analysis, we show that a block-diagonal Hessian yields an easier optimization problem than a full Hessian. We also show that the condition of block-diagonality in common machine learning models can be achieved by simply selecting an appropriate training criterion. Finally, we propose a version of the SVM criterion applied to MLPs, which verifies the aspects highlighted in this second order analysis, but also yields very good generalization performance in practice, taking advantage of the margin effect. Several empirical comparisons on two benchmark datasets are given to illustrate this approach.

Probably because in the past neural network were studied on very small databases, many people believe neural networks overfit easily. I would correct by: if not well tuned (like a SVM having a Gaussian kernel with a small variance!) neural networks do overfit. But in fact, in many cases, they are hard to train. We show here that the choice of the architecture itself has an impact on the optimization. In particular we show that the margin criterion used in SVMs is well suited for neural network optimization: with the hinge loss, the Hessian is better conditioned than classical loss like Mean Squared Error.

@inproceedings{collobert:2004,
  title = {A Gentle Hessian for Efficient Gradient Descent},
  author = {R. Collobert and S. Bengio},
  booktitle = {{IEEE} International Conference on Acoustic, Speech, and Signal Processing, {ICASSP}},
  year = {2004}
}
PDF

2003

R. Collobert, Y. Bengio and S. Bengio. Scaling Large Learning Problems with Hard Parallel Mixtures. International Journal on Pattern Recognition and Artificial Intelligence (IJPRAI), 17(3):349-365, 2003.

A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a ``hard parallelizable mixture'' methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a ``gater'' model in such a way that it becomes easy to learn an ``expert'' model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative log-likelihood.

The aim was to use a divide-and-conquer method to break up the SVM complexity and solve large scale classification tasks. While these mixtures do work, they are unfortunately quite difficult to tune, because of the additional hyper-parameters involved in the architecture. This paper has been originally presented at the International Workshop on Pattern Recognition with Support Vector Machines (SVM'2002). The original paper, with less experiments and without probabilistic mixtures, has been published in NIPS. A variant, including more experiments than the NIPS version has been published in Neural Computation.

@article{collobert:2003,
  title = {Scaling Large Learning Problems with Hard Parallel Mixtures},
  author = {R. Collobert and Y. Bengio and S. Bengio},
  journal = {International Journal on Pattern Recognition and Artificial Intelligence ({IJPRAI})},
  volume = {17},
  number = {3},
  pages = {349--365},
  year = {2003}
}
PDF

C. Sanderson, S. Bengio, H. Bourlard, J. Mariéthoz, R. Collobert, M.F. BenZeghiba, F. Cardinaux and S. Marcel. Speech & Face Based Biometric Authentication at IDIAP. In International Conference on Multimedia and Expo, ICME, volume 3, pages 1-4, 2003.

We present an overview of recent research at IDIAP on speech & face based biometric authentication. This paper covers user-customised passwords, adaptation techniques, confidence measures (for use in fusion of audio & visual scores), face verification in difficult image conditions, as well as other related research issues. We also overview the open source Torch library, which has aided in the implementation of the above mentioned techniques.

@inproceedings{collobert:2003a,
  title = {Speech \& Face Based Biometric Authentication at {IDIAP}},
  author = {C. Sanderson and S. Bengio and H. Bourlard and J. Mari\'ethoz and R. Collobert and M.F. BenZeghiba and F. Cardinaux and S. Marcel},
  booktitle = {International Conference on Multimedia and Expo, {ICME}},
  volume = {3},
  pages = {1--4},
  year = {2003}
}
PDF

2002

R. Collobert, S. Bengio and J. Mariéthoz. Torch: a modular machine learning software library. Technical Report IDIAP-RR 02-46, IDIAP, 2002.

Many scientific communities have expressed a growing interest in machine learning algorithms recently, mainly due to the generally good results they provide, compared to traditional statistical or AI approaches. However, these machine learning algorithms are often complex to implement and to use properly and efficiently. We thus present in this paper a new machine learning software library in which most state-of-the-art algorithms have already been implemented and are available in a unified framework, in order for scientists to be able to use them, compare them, and even extend them for their own purposes. More interestingly, this library is freely available under a BSD license and can be retrieved from the web by everyone.

This presented the first version of the Torch machine learning library. Several versions have been developped since then, culminating with Torch5, the official last version.

@techreport{collobert:2002,
  title = {{T}orch: a modular machine learning software library},
  author = {R. Collobert and S. Bengio and J. Mari\'ethoz},
  institution = {IDIAP},
  type = {Technical Report IDIAP-RR},
  number = {02-46},
  year = {2002}
}
PDF Torch 5 Torch 3

R. Collobert, S. Bengio and Y. Bengio. A Parallel Mixture of SVMs for Very Large Scale Problems. In T.G. Dietterich, S. Becker and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, NIPS 14, pages 633-640, MIT Press, 2002.

Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database, yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed on Forest.

This is our first paper on Mixture of SVMs. The aim was to use a divide-and-conquer method to break up the SVM complexity and solve large scale classification tasks. While these mixtures do work, they are unfortunately quite difficult to tune, because of the additional hyper-parameters involved in the architecture. A variant of this paper, with more experiments, has been published in Neural Computation. An extended version, including more experiments and probabilistic mixtures has been published in IJPRAI and presented at SVM'2002.

@inproceedings{collobert:2002a,
  title = {A Parallel Mixture of {SVMs} for Very Large Scale Problems},
  author = {R. Collobert and S. Bengio and Y. Bengio},
  booktitle = {Advances in Neural Information Processing Systems, {NIPS} 14},
  publisher = {MIT Press},
  editor = {Dietterich, T.G. and Becker, S. and Ghahramani, Z.},
  pages = {633--640},
  year = {2002}
}
PDF

R. Collobert, S. Bengio and Y. Bengio. A Parallel Mixture of SVMs for Very Large Scale Problems. Neural Computation, 14(5):1105-1114, 2002.

Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed.

The aim was to use a divide-and-conquer method to break up the SVM complexity and solve large scale classification tasks. While these mixtures do work, they are unfortunately quite difficult to tune, because of the additional hyper-parameters involved in the architecture. The original paper, with less experiments, has been published in NIPS. An extended version, including more experiments and probabilistic mixtures has been published in IJPRAI and presented at SVM'2002.

@article{collobert:2002b,
  title = {A Parallel Mixture of {SVMs} for Very Large Scale Problems},
  author = {R. Collobert and S. Bengio and Y. Bengio},
  journal = {Neural Computation},
  volume = {14},
  number = {5},
  pages = {1105--1114},
  year = {2002}
}
PDF

2001

R. Collobert and S. Bengio. SVMTorch: Support Vector Machines for Large-Scale Regression Problems. Journal of Machine Learning Research, 1:143-160, 2001.

Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l square memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch (available at https://ronan.collobert.com/SVMTorch), which is similar to SVM-Light proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for large-scale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm.

Our contribution extends Joachims ideas to the regression SVM problem. Though nowadays it may seems obvious, curiously it was not the technique used to train regression SVMs at the time we proposed this extension.

@article{collobert:2001,
  title = {{SVMT}orch: Support Vector Machines for Large-Scale Regression Problems},
  author = {R. Collobert and S. Bengio},
  journal = {Journal of Machine Learning Research},
  volume = {1},
  pages = {143--160},
  year = {2001}
}
PDF SVMTorch

2000

R. Collobert. Support Vector Machines: Théorie et Applications. Université de Rennes I, 2000.

This is my master thesis (called DEA, in french) from Université de Rennes I. I did an internship at IDIAP for that purpose.

@mastersthesis{collobert:2000,
  title = {Support Vector Machines: Th\'eorie et Applications},
  author = {R. Collobert},
  school = {Universit\'e de Rennes {I}},
  year = {2000}
}
PDF SVMTorch

Former PhD students or Postdocs