1 |
What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations ...
|
|
|
|
BASE
|
|
Show details
|
|
2 |
Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
Enforcing Consistency in Weakly Supervised Semantic Parsing ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Competency Problems: On Finding and Removing Artifacts in Language Data ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Beyond Accuracy: Behavioral Testing of NLP models with CheckList ...
|
|
|
|
Abstract:
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many ...
|
|
Keyword:
Computation and Language cs.CL; FOS Computer and information sciences; Machine Learning cs.LG
|
|
URL: https://dx.doi.org/10.48550/arxiv.2005.04118 https://arxiv.org/abs/2005.04118
|
|
BASE
|
|
Hide details
|
|
7 |
Evaluating Models' Local Decision Boundaries via Contrast Sets ...
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Distributed Non-Parametric Representations for Vital Filtering: UW at TREC KBA 2014
|
|
|
|
In: DTIC (2014)
|
|
BASE
|
|
Show details
|
|
14 |
A Pilot Study on Gender Differences in Conversational Speech on Lexical Richness Measures
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Evaluation of an objective technique for anlaysing temporal variables in DAT spontaneous speech
|
|
|
|
BASE
|
|
Show details
|
|
|
|