Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1551 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Certifying Some Distributional Robustness with Principled Adversarial Training

Aman Sinha and Hongseok Namkoong and John Duchi

arXiv e-Print archive - 2017 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2017/10/29 (3 years ago)

**Abstract:** Neural networks are vulnerable to adversarial examples and researchers have
proposed many heuristic attack and defense mechanisms. We address this problem
through the principled lens of distributionally robust optimization, which
guarantees performance under adversarial input perturbations. By considering a
Lagrangian penalty formulation of perturbing the underlying data distribution
in a Wasserstein ball, we provide a training procedure that augments model
parameter updates with worst-case perturbations of training data. For smooth
losses, our procedure provably achieves moderate levels of robustness with
little computational or statistical cost relative to empirical risk
minimization. Furthermore, our statistical guarantees allow us to efficiently
certify robustness for the population loss. For imperceptible perturbations,
our method matches or outperforms heuristic approaches.
more
less

Aman Sinha and Hongseok Namkoong and John Duchi

arXiv e-Print archive - 2017 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
Sinha et al. introduce a variant of adversarial training based on distributional robust optimization. I strongly recommend reading the paper for understanding the introduced theoretical framework. The authors also provide guarantees on the obtained adversarial loss – and show experimentally that this guarantee is a realistic indicator. The adversarial training variant itself follows the general strategy of training on adversarially perturbed training samples in a min-max framework. In each iteration, an attacker crafts an adversarial examples which the network is trained on. In a nutshell, their approach differs from previous ones (apart from the theoretical framework) in the used attacker. Specifically, their attacker optimizes $\arg\max_z l(\theta, z) - \gamma \|z – z^t\|_p^2$ where $z^t$ is a training sample chosen randomly during training. On a side note, I also recommend reading the reviews of this paper: https://openreview.net/forum?id=Hk6kPgZA- Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Taskonomy: Disentangling Task Transfer Learning

Amir Zamir and Alexander Sax and William Shen and Leonidas Guibas and Jitendra Malik and Silvio Savarese

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.AI, cs.LG, cs.NE, cs.RO

**First published:** 2018/04/23 (3 years ago)

**Abstract:** Do visual tasks have a relationship, or are they unrelated? For instance,
could having surface normals simplify estimating the depth of an image?
Intuition answers these questions positively, implying existence of a structure
among visual tasks. Knowing this structure has notable values; it is the
concept underlying transfer learning and provides a principled way for
identifying redundancies across tasks, e.g., to seamlessly reuse supervision
among related tasks or solve many tasks in one system without piling up the
complexity.
We proposes a fully computational approach for modeling the structure of
space of visual tasks. This is done via finding (first and higher-order)
transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D,
and semantic tasks in a latent space. The product is a computational taxonomic
map for task transfer learning. We study the consequences of this structure,
e.g. nontrivial emerged relationships, and exploit them to reduce the demand
for labeled data. For example, we show that the total number of labeled
datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3
(compared to training independently) while keeping the performance nearly the
same. We provide a set of tools for computing and probing this taxonomical
structure including a solver that users can employ to devise efficient
supervision policies for their use cases.
more
less

Amir Zamir and Alexander Sax and William Shen and Leonidas Guibas and Jitendra Malik and Silvio Savarese

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.AI, cs.LG, cs.NE, cs.RO

[link]
The goal of this work is to perform transfer learning among numerous tasks and to discover visual relationships among them. Specifically, while we intiutively might guess the depth of an image and surface normals are related, this work takes a step forward and discovers a beneficial relationship among 26 tasks in terms of task transferability - many of them are not obvious. This is important for scenarios when an insufficient budget is available for target task for annotation, thus, learned representation from the 'cheaper' task could be used along with small dataset for the target task to reach sufficient performance on par with fully supervised training on a large dataset. The basis of the approach is to compute an affinity matrix among tasks based on whether the solution for one task can be sufficiently easily used for another task. This approach does not impose human intuition about the task relationships and chooses task transferability based on the quality of a transfer operation in a fully computational manner. The task taxonomy (i.e. **taskonomy**) is a computationally found directed hypergraph that captures the notion of task transferability over any given task dictionary. It performed using a four-step process depicted in the figure below: ![Process overview. The steps involved in creating the taxonomy.](http://taskonomy.stanford.edu/img/svg/Process.svg) - In stage I (**Task-specific Modelling**), a task-specific network is trained in a fully supervised manner. The network is composed of the encoder (modified ResNet-50), and fully convolutional decoder for pixel-to-pixel tasks, or 2-3 FC layers for low-dimensional tasks. Dataset consists of 4 million images of indoor scenes from about 600 buildings; every image has an annotation for every task. - In stage II (**Transfer modeling**), all feasible transfers between sources and targets are trained (multiple inputs task to single target transfer is also considered). Specifically, after the task-specific networks are trained in stage I, the weights of an encoder are fixed (frozen network is used to extract representations only) and the representation from the encoder is used to train a small readout network (similar to a decoder from stage I) with a new task as a target (i.e. ground truth is available). In total, about 3000 transfer possibilities are trained. - In stage III (**Taxonomy solver**), the task affinities acquired from the transfer functions performance are normalized. This is needed because different tasks lie in different spaces and transfer function scale. This is performed using ordinal normalization - Analytical Hierarchy Process (details are in the paper - Section 3.3). This results in an affinity matrix where a complete graph of relationships is completely normalized and this graph quantifies a pair-wise set of tasks evaluated in terms of a transfer function (i.e. task dependency). - In stage IV (**Computed Taxonomy**), a hypergraph which can predict the performance of any transfer policy and optimize for the optimal one is synthesized. This is solved using Binary Integer Program as a subgraph selection problem where tasks are nodes and transfers are edges. After the optimization process, the solution devices a connectivity that solves all target tasks, maximizes their collective performance while using only available source tasks under user-specified constraints (e.g. budget). So, if you want to train your network on an unseen task, you can obtain pretrained weights for existing tasks from the [project page](https://github.com/StanfordVL/taskonomy/tree/master/taskbank), train readout functions against each task (as well as combination of multiple inputs), build an affinity matrix to know where your task is positioned against the other ones, and through subgraph selection procedure observe what tasks have favourable influence on your task. Consequently, you can train your task with much less data by utilizing representations from the existing tasks which share visual significance with your task. Magnificent! |

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2016/02/16 (5 years ago)

**Abstract:** Despite widespread adoption, machine learning models remain mostly black
boxes. Understanding the reasons behind predictions is, however, quite
important in assessing trust, which is fundamental if one plans to take action
based on a prediction, or when choosing whether to deploy a new model. Such
understanding also provides insights into the model, which can be used to
transform an untrustworthy model or prediction into a trustworthy one. In this
work, we propose LIME, a novel explanation technique that explains the
predictions of any classifier in an interpretable and faithful manner, by
learning an interpretable model locally around the prediction. We also propose
a method to explain models by presenting representative individual predictions
and their explanations in a non-redundant way, framing the task as a submodular
optimization problem. We demonstrate the flexibility of these methods by
explaining different models for text (e.g. random forests) and image
classification (e.g. neural networks). We show the utility of explanations via
novel experiments, both simulated and with human subjects, on various scenarios
that require trust: deciding if one should trust a prediction, choosing between
models, improving an untrustworthy classifier, and identifying why a classifier
should not be trusted.
more
less

Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
This paper describes how to find local interpretable model-agnostic explanations (LIME) why a black-box model $m_B$ came to a classification decision for one sample $x$. The key idea is to evaluate many more samples around $x$ (local) and fit an interpretable model $m_I$ to it. The way of sampling and the kind of interpretable model depends on the problem domain. For computer vision / image classification, the image $x$ is divided into superpixels. Single super-pixels are made black, the new image $x'$ is evaluated $p' = m_B(x')$. This is done multiple times. The paper is also explained in [this YouTube video](https://www.youtube.com/watch?v=KP7-JtFMLo4) by Marco Tulio Ribeiro. A very similar idea is already in the [Zeiler & Fergus paper](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma). ## Follow-up Paper * June 2016: [Model-Agnostic Interpretability of Machine Learning](https://arxiv.org/abs/1606.05386) * November 2016: * [Nothing Else Matters: Model-Agnostic Explanations By Identifying Prediction Invariance](https://arxiv.org/abs/1611.05817) * [An unexpected unity among methods for interpreting model predictions](https://arxiv.org/abs/1611.07478) |

Big Bird: Transformers for Longer Sequences

Zaheer, Manzil and Guruganesh, Guru and Dubey, Avinava and Ainslie, Joshua and Alberti, Chris and Ontañón, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and Ahmed, Amr

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: transfer-learning, pre-trained, transformer, bert

Zaheer, Manzil and Guruganesh, Guru and Dubey, Avinava and Ainslie, Joshua and Alberti, Chris and Ontañón, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and Ahmed, Amr

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: transfer-learning, pre-trained, transformer, bert

[link]
Transformers - powered by self-attention mechanisms - have been a paradigm shift in NLP, and are now the standard choice for training large language models. However, while transformers do have many benefits in terms of computational constraints - most saliently, that attention between tokens can be computed in parallel, rather than needing to be evaluated sequentially like in a RNN - a major downside is their memory (and, secondarily, computational) requirements. The baseline form of self-attention works by having every token attend to every other token, where "attend" here means that a query from each token A will take an inner product with each other token -A, and then be elementwise-multiplied with the values of every other token -A. This implies a O(N^2) memory and computation requirement, where N is your sequence length. So, the question this paper asks is: how do you get the benefits, or most of the benefits, of a full-attention network, while reducing the number of other tokens each token attends to. The authors' solution - Big Bird - has three components. First, they approach the problem of approximating the global graph as a graph theory problem, where each token attending to every other is "fully connected," and the goal is to try to sparsify the graph in a way that keeps shortest path between any two nodes low. They use the fact that in an Erdos-Renyi graph - where very edge is simply chosen to be on or off with some fixed probability - the shortest path is known to be logN. In the context of aggregating information about a sequence, a short path between nodes means that the number of iterations, or layers, that it will take for information about any given node A to be part of the "receptive field" (so to speak) of node B, will be correspondingly short. Based on this, they propose having the foundation of their sparsified attention mechanism be simply a random graph, where each node attends to each other with probability k/N, where k is a tunable hyperparameter representing how many nodes each other node attends to on average. To supplement, the authors further note that sequence tasks of interest - particularly language - are very local in their information structure, and, while it's important to understand the global context of the full sequence, tokens close to a given token are most likely to be useful in constructing a representation of it. Given this, they propose supplementing their random-graph attention with a block diagonal attention, where each token attends to w/2 tokens prior to and subsequent to itself. (Where, again, w is a tunable hyperparameter) However, the authors find that these components aren't enough, and so they add a final component: having some small set of tokens that attend to all tokens, and are attended to by all tokens. This allows them to theoretically prove that Big Bird can approximate full sequences, and is a universal Turing machine, both of which are true for full Transformers. I didn't follow the details of the proof, but, intuitively, my reading of this is that having a small number of these global tokens basically acts as a shortcut way for information to get between tokens in the sequence - if information is globally valuable, it can be "written" to one of these global aggregator nodes, and then all tokens will be able to "read" it from there. The authors do note that while their sparse model approximates the full transformer well in many settings, there are some problems - like needing to find the token in the sequence that a given token is farthest from in vector space - that a full attention mechanism could solve easily (since it directly calculates all pairwise comparisons) but that a sparse attention mechanism would require many layers to calculate. Empirically, Big Bird ETC (a version which adds on additional tokens for the global aggregators, rather than making existing tokens serve thhttps://i.imgur.com/ks86OgJ.pnge purpose) performs the best on a big language model training objective, has comparable performance to existing models on questionhttps://i.imgur.com/x0BdamC.png answering, and pretty dramatic performance improvements in document summarization. It makes sense for summarization to be a place where this model in particular shines, because it's explicitly designed to be able to integrate information from very large contexts (albeit in a randomly sampled way), where full-attention architectures must, for reasons of memory limitation, do some variant of a sliding window approach. |

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size

Iandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Iandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
$\bf Summary:$ The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter). The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network. The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet. $\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models? |

About