J. Puigcerver, C. Riquelme, B. Mustafa, C. Renggli, A. Susano Pinto, S. Gelly, D. Keysers, N. Houlsby.
Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2-3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases.
Arxiv, 2020
J. Djolonga, J. Yung, M. Tschannen, R. Romijnders, L. Beyer, A. Kolesnikov, J. Puigcerver, M. Minderer, A. D'Amour, D. Moldovan, S. Gelly, N. Houlsby, X. Zhai, and M. Lucic.
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we revisit the out-of-distribution and transfer performance of modern image classification CNNs and investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness. Furthermore, we show that, perhaps surprisingly, simple changes in the preprocessing such as modifying the image resolution can significantly mitigate robustness issues in some cases. Finally, we outline the shortcomings of existing robustness evaluation datasets and introduce a synthetic dataset we use for a systematic analysis across common factors of variation.
Arxiv, 2020
A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby.
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
Arxiv, 2019
X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djolonga, A. Susano Pinto, M. Neumann, A. Dosovitskiy, L. Beyer, O. Bachem, M. Tschannen, M. Michalski, O. Bousquet, S. Gelly, and N. Houlsby.
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, reconstruction error). We present the Visual Task Adaptation Benchmark (VTAB), which defines good representations as those that adapt to diverse, unseen tasks with few examples. With VTAB, we conduct a large-scale study of many popular publicly-available representation learning algorithms. We carefully control confounders such as architecture and tuning budget. We address questions like: How effective are ImageNet representations beyond standard natural datasets? How do representations trained via generative and discriminative models compare? To what extent can self-supervision replace labels? And, how close are we to general visual representations?
Arxiv, 2019
A. H. Toselli, E. Vidal, J. Puigcerver, and E. Noya-GarcĂ­a.
Keyword spotting techniques are becoming cost-effective solutions for information retrieval in handwritten documents. We explore the extension of the single-word, line-level probabilistic indexing approach described in our previous works to allow for page-level search of queries consisting in Boolean combinations of several single-keywords. We propose heuristic rules to combine the single-word relevance probabilities into probabilistically consistent confidence scores of the multi-word boolean combinations. An empirical study, also presented in this paper, evaluates the search performance of word-pair queries involving AND and OR Boolean operations. Results of this study support the proposed approach and clearly show its effectiveness. Finally, a web-based demonstration system based on the proposed methods is presented.
Pattern Analysis and Applications (PAA), 2019
E. Lang, J. Puigcerver, A. H. Toselli, and E. Vidal.
We endeavor to perform very large scale indexing of an ancient German collection of manuscript parish records. To this end we will compute "probabilistic indexes" (PIs), which are known to allow for very accurate and efficient implementation of (single-)keyword spotting. PIs may become prohibitively large for vast manuscript collections. Therefore we analyze simple index pruning methods to achieve adequate tradeoffs between memory requirements and search performance. We also study how to adequately deal with the large variety of non-ASCII symbols and handwritten word spelling variations (accents, umlauts, etc.) which appear in this kind of historical collections. Finally, and most importantly, since most of the images of the collection we aim to index are handwritten tables, we explore the use of PIs to support structured queries for information extraction from untranscribed handwritten images containing tabular data. Empirical results on a small, but complex and representative dataset extracted from the collection considered confirm the viability and adequateness of the chosen approaches.
ICFHR 2018 (Poster)
J. Puigcerver
Current state-of-the-art approaches to offline Handwritten Text Recognition extensively rely on Multidimensional Long Short-Term Memory networks. However, these architectures come with quite an expensive computational cost, and we observe that they extract features visually similar to those of convolutional layers, which are computationally cheaper. This suggests that the two-dimensional long-term dependencies, which are potentially modeled by multidimensional recurrent layers, may not be essential to achieve a good recognition accuracy, at least in the lower layers of the architecture. In this work, an alternative model is explored that relies only on convolutional and one-dimensional recurrent layers that achieves better or equivalent results than those of the current state-of-the-art architecture, and runs significantly faster. In addition, we observe that using random distortions during training as synthetic data augmentation dramatically improves the accuracy of our model. Thus, are multidimensional recurrent layers really necessary for Handwritten Text Recognition? Probably not.
ICDAR 2017 (Oral)
T. Bluche, S. Hamel, C. Kermovant, J. Puigcerver, D. Stutzmann, A. H. Toselli, and E. Vidal.
Making large-scale collections of digitized historical documents searchable is being earnestly demanded by many archives and libraries. Probabilistically indexing the text images of these collections by means of keyword spotting techniques is currently seen as perhaps the only feasible approach to meet this demand. A vast medieval manuscript collection, written in both Latin and French, called "Chancery", is currently being considered for indexing at large. In addition to its bilingual nature, one of the major difficulties of this collection is the very high rate of abbreviated words which, on the other hand, are completely expanded in the ground truth transcripts available. In preparation to undertake full indexing of Chancery, experiments have been carried out on a relatively small but fully representative subset of this collection. To this end, a keyword spotting approach has been adopted which computes word relevance probabilities using character lattices produced by a recurrent neural network and a N-gram character language model. Results confirm the viability of the chosen approach for the large-scale indexing aimed at and show the ability of the proposed modeling and training approaches to properly deal with the abbreviation difficulties mentioned.
ICDAR 2017 (Poster)
A.H. Toselli, J. Puigcerver, and E. Vidal
Two methods are presented to improve word confidence scores for Line-Level Query-by-String Lexicon-Free Keyword Spotting (KWS) in handwritten text images. The first one approaches true relevance probabilities by means of computations directly carried out on character lattices obtained from the lines images considered. The second method uses the same character lattices, but it obtains relevance scores by first computing frame-level character sequence scores which resemble the word posteriorgrams used in previous approaches for lexicon-based KWS. The first method results from a formal probabilistic derivation, which allow us to better understand and further develop the underlying ideas. The second one is less formal but, according with experiments presented in the paper, it obtains almost identical results with much lower computational cost. Moreover, in contrast with the first method, the second one allows to directly obtain accurate bounding boxes for the spotted words.
ICFHR 2016 (Poster)
I. Pratikakis, K. Zagoris, B. Gatos, J. Puigcerver, A.H. Toselli, and E. Vidal
The H-KWS 2016, organized in the context of the ICFHR 2016 conference aims at setting up an evaluation framework for benchmarking handwritten keyword spotting (KWS) examining both the Query by Example (QbE) and the Query by String (QbS) approaches. Both KWS approaches were hosted into two different tracks, which in turn were split into two distinct challenges, namely, a segmentation-based and a segmentation-free to accommodate different perspectives adopted by researchers in the KWS field. In addition, the competition aims to evaluate the submitted training-based methods under different amounts of training data. Four participants submitted at least one solution to one of the challenges, according to the capabilities and/or restrictions of their systems. The data used in the competition consisted of historical German and English documents with their own characteristics and complexities. This paper presents the details of the competition, including the data, evaluation metrics and results of the best run of each participating methods.
ICFHR 2016 (Competition)
J. Puigcerver, A.H. Toselli, and E. Vidal
Lexicon-based handwritten text keyword spotting (KWS) has proven to be a faster and more accurate alternative to lexicon-free methods. Nevertheless, since lexicon-based KWS relies on a predefined vocabulary, fixed in the training phase, it does not support queries involving out-of-vocabulary (OOV) keywords. In this paper, we outline previous work aimed at solving this problem and present a new approach based on smoothing the (null) scores of OOV keywords by means of the information provided by "similar" in-vocabulary words. Good results achieved using this approach are compared with previously published alternatives on different data sets.
Neural Computing and Applications (NCAA), 2016
J. Puigcerver, A.H. Toselli, and E. Vidal
Traditionally, the HMM-Filler approach has been widely used in the fields of speech recognition and handwritten text recognition to tackle lexicon-free, query-by-string keyword spotting (KWS). It computes a score to determine whether a given keyword is written in a certain image region. It is conjectured, that this score is related to the confidence of the system, respect to the previous question. However, it is still not clear what this relationship is. In this paper, the HMM-Filler score is derived from a probabilistic formulation of KWS, which gives a better understanding of its behavior and limits. Additionally, the same probabilistic framework is used to present a new algorithm to compute the KWS scores, which results in better average precision (AP), for a keyword spotting task in the widely used IAM database. We show that the new algorithm can improve the HMM-filler results up to 10.4% relative (5.3% absolute) points in AP, in the considered task.
ICDAR 2015 (Oral)
A.H. Toselli, J. Puigcerver, and E. Vidal
The so-called filler or garbage Hidden Markov Models (HMM-Filler) are among the most widely used models for lexicon-free, query by string key word spotting in the fields of speech recognition and (lately) handwritten text recognition. This approach has important drawbacks. First, the keyword-specific HMM Viterbi decoding process needed to obtain the confidence scores of each spotted word involves a large computational cost. Second, in its traditional conception, the "filler" does not take into account any context information. And in case it does, even though the involved greater computational cost, the required keyword-specific language model building can become quite intricate. This paper presents novel keyword spotting results by using a character lattice based KWS approach with context information provided by employing high order N-gram models. This approach has proved to be faster than the traditional HMM-Filler approach, w here required confidence scores are computed directly from character lattices produced during a single Viterbi decoding process using N-gram models. Experiments show that, as compared with the HMM-filler approach using 2-gram model, the character lattice based method requires between one and two orders of magnitude less query computing time.
ICDAR 2015 (Poster)
E. Vidal, A.H. Toselli, and J. Puigcerver
Keyword Spotting (KWS) has been traditionally considered under two distinct frameworks: Query-by-Example (QbE) and Query-by-String (QbS). In both cases, the user of the system wished to find occurrences of a particular keyword in a collection of document images. The difference is that, in QbE the keyword is given as an exemplar image while, in the case of QbS, the keyword is given as a text string. In several works, the QbS scenario has been approached using QbE techniques; but the converse has not been studied in depth yet, despite of the fact that QbS systems typically achieve higher accuracy. In the present work, we present a very effective probabilistic approach for QbE KWS, based on highly accurate QbS KWS techniques. To assess the effectiveness of this approach, we tackle the segmentation-free QbE task of the ICFHR-2014 Competition on Handwritten KWS. Our approach achieves a me an average precision (mAP) as high as 0.715, which improves by more than 70% the best mAP achieved in this competition (0.419 under the same experimental conditions).
ICDAR 2015 (Poster)
J. Puigcerver, A.H. Toselli, and E. Vidal
The principal goal of the Competition on Keyword Spotting for Handwritten Documents was to promote different approaches used in the field of Keyword Spotting and to fairly compare them using uniform data and metrics. To accommodate different perspectives adopted by researches in this field, the competition was divided into two distinct tracks, namely, a training-free and a training-based track, and each track entailed two optional assignments. Six participants submitted solutions to one or both assignments, depending on the capabilities and/or restrictions of their systems. The data used in the competition consisted of historical documents in English with different levels of complexity. This paper presents the details of the competition, including the data, evaluation metrics and results of the best participant methods.
ICDAR 2015 (Competition)
J. Puigcerver, A.H. Toselli, and E. Vidal
Lexicon-based handwritten text keyword spotting (KWS) has proven to be a very fast and accurate alternative to lexicon-free methods. Nevertheless, since lexicon-based KWS methods rely on a predefined vocabulary, fixed in the training phase, they perform poorly for any query keyword that was not included in it (i.e. out-of-vocabulary keywords). This turns the KWS system useless for that particular type of queries. In this paper, we present a new way of smoothing the scores of OOV keywords, and we compare it with previously published alternatives on different data sets.
IbPRIA 2015 (Oral)
J. Puigcerver, A.H. Toselli, and E. Vidal
We present a handwritten text Keyword Spotting (KWS) approach based on the combination of KWS methods using word-graphs (WGs) and character-lattices (CLs). It aims to solve the problem that WG-based models present for out of vocabulary (OOV) keywords: since there is no available information about them in the lexicon or the language model, null scores are assigned. OOV keywords may have a significant impact on the global performance of KWS systems, as we show. By using a CL approach, which does not suffer from the previous problem, to estimate the OOV scores, we take advantage of both models, using the speed and accuracy that WGs provide for in-vocabulary keywords and the flexibility of the CL approach. This combination improves significantly both average precision and mean average precision over the two methods.
ICFHR 2014 (Poster)
J. Puigcerver, A.H. Toselli, and E. Vidal
Thanks to the use of lexical and syntactic information, Word Graphs (WG) have shown to provide a competitive Precision-Recall performance, along with fast lookup times, in comparison to other techniques used for Key-Word Spotting (KWS) in handwritten text images. However, a problem of WG approaches is that they assign a null score to any keyword that was not part of the training data, i.e. Out-of-Vocabulary (OOV) keywords, whereas other techniques are able to estimate a reasonable score even for these kind of keywords. We present a smoothing technique which estimates the score of an OOV keyword based on the scores of similar keywords. This makes the WG-based KWS as flexible as other techniques with the benefit of having much faster lookup times.
ICPR 2014 (Oral)