1 |
What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations ...
|
|
|
|
BASE
|
|
Show details
|
|
2 |
Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
Enforcing Consistency in Weakly Supervised Semantic Parsing ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Competency Problems: On Finding and Removing Artifacts in Language Data ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Beyond Accuracy: Behavioral Testing of NLP models with CheckList ...
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Evaluating Models' Local Decision Boundaries via Contrast Sets ...
|
|
Gardner, Matt; Artzi, Yoav; Basmova, Victoria; Berant, Jonathan; Bogin, Ben; Chen, Sihao; Dasigi, Pradeep; Dua, Dheeru; Elazar, Yanai; Gottumukkala, Ananth; Gupta, Nitish; Hajishirzi, Hanna; Ilharco, Gabriel; Khashabi, Daniel; Lin, Kevin; Liu, Jiangming; Liu, Nelson F.; Mulcaire, Phoebe; Ning, Qiang; Singh, Sameer; Smith, Noah A.; Subramanian, Sanjay; Tsarfaty, Reut; Wallace, Eric; Zhang, Ally; Zhou, Ben. - : arXiv, 2020
|
|
Abstract:
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model ...
|
|
Keyword:
Computation and Language cs.CL; FOS Computer and information sciences
|
|
URL: https://arxiv.org/abs/2004.02709 https://dx.doi.org/10.48550/arxiv.2004.02709
|
|
BASE
|
|
Hide details
|
|
9 |
Distributed Non-Parametric Representations for Vital Filtering: UW at TREC KBA 2014
|
|
|
|
In: DTIC (2014)
|
|
BASE
|
|
Show details
|
|
14 |
A Pilot Study on Gender Differences in Conversational Speech on Lexical Richness Measures
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Evaluation of an objective technique for anlaysing temporal variables in DAT spontaneous speech
|
|
|
|
BASE
|
|
Show details
|
|
|
|