# Pavlick and Kwiatkowski: Inherent Disagreements in Human Textual Inferences

Tags: nlu, papers

## Summary

• Talks about how human annotations and disagreements from those are not derived from statistical noise
• More context does not necessarily mean more agreement, once the input reaches sentence/passage, disagreements hold relatively steady
• We assume what people write is what they mean, but actually there’s a pretty big gap between what people mean <-> what they write <-> what they interpret <-> what they infer. NLP models currently only model the 2nd and 3rd arrows.
• punted on precise definitions of “real world” and rather tried to have their models approximate “what humans do”

## Annotations

• Uses a combination of RTE2, SNLI, MNLI, JOCI, and DNC
• RTE2 - premises/hypothesis combos
• SNLI - premises from image captions, hypoteshis from existing NLI dataets paired with hypotehsis that were automatically generated
• MNLI - Same as SNLI but with a range of text genres
• JOCI - “commen sense” inferences
• DNC - mostly naturally occuring premises paired with template generated hypothesis

### Preprocessing

• They had 500 workers complete and rank the response on a slider
• Continous scale not always the best, they had to do some z-score normalization in order for all the data to make sense

## Analysis

• Ultimately wanted to judge how much “noise” exists in the annotation process

• If there is a single truth, then the the “noise” should be generalizable with a single gaussian distribution

• If there are “multiple” truths, then a gaussian mixture model should be correct

• Assumption: if a single truth exists, the GMM woul be the exact same as the gaussian

• Model chooses to fit towards the GMM than the gaussian

• Example: is the word “swat” forceful? Is “confess that” factive?

• In the first section, they note that the annotations can be modeled by the GMM, implying that humans believe there are multiple “truths”

• NLP models currently only believe there is one truth to model

### Context

• Sampled sentences from wikipedia, and considered each sentence to be a premise, and generated a hyptothesis by replacing the corresponding a word from the premise with a substiute word, where the substitue word is either a hypernym/hyponym, antonym, or co-hyponym
• Collected ratings at 3 levels
• word
• sentence
• passage
• Disagreements among raters actually increase when more context is shown
• Definitely some confounds abound

## Model Predictions

• Is this a problem at all?
• what if the underlying distributions already reflect the distributions observed in human judgements, and the models already adaquetely capture that with softmax?
• since nli is usually treated as a classification, they discretize (after experimenting with z-normalied human scores) by mapping into different bins
• entailment~/~contradiction~/~neutral label bins
• They used a pretrained bert and fine tuned it on the labels
• Attempted to see how well bert captured the underlying multi-modal distribution with a softmax
• found that the softmax is a poor approximation