Last Updated on September 30, 2022 by David Vause
Text Classification
supervised learning: machines learn from past instances.
training phase: information is gathered, and a model is built.
inference phase: model is applied. unlabeled data.
labeled inputs: labels are known.
model is built: the classification model
unlabeled input -> classification model -> labeled outputs.
Learn a classification model on properties (“features”) and their importance (“weights”) from labeled instances.
- X: set of attributes or features {x1, x2, …, xn} The input.
- y: A “class” label from the label set Y={y1, y2, …, yn}
Apply the model to instances to predict the label.
Validation set. The set of the training data that is used to test on.
Classification Paradigms.
- binary classification: the number of possible classes is two. |Y| = 2
- multi-class: number of classes is greater than two. |Y| > 2
- multil-label classification: instances can have two or more labels.
Training phase:
- what are the features and how do you represent them?
- What is the classification model or algorithm?
- What are the model parameters?
Inference phase:
- What is the expected performance? What is a good measure?
Identifying Features from Text
Types of textual features:
- words
- stop words: commonly occurring words
- normalization: case
- stemming/lemmatizing: plurals are same as singulars
- case
- White House vs. white house
- parts of speech
- whether vs. weather
- grammatical structure, sentence parsing
- semantics: one feature for a particular group of words
- {buy, purchase}
- honorifics, numbers, dates
Naive Bayes Classifiers
- prior probability:
- Pr(y=entertainment), Pr(y=CS), Pr(y=zoology)
- sum equals 1
- update the likelihood of the class given new information.
- posterior probability: Pr(y=entertainment|x=’Python’)
Bayes’ Rule
- \(\text {posterior probability} = \frac{\text{prior probability} \times likelihood} {evidence} \)
- \( Pr(y|X) = \frac {Pr(y) \times Pr(X|y)} {Pr(X)}\)