Last Updated on November 20, 2022 by David Vause
The DataFrame apply() Function
An example using Natural Language Tool Kit
moby_raw: is text from nltk.collections
The first assignment of sentences results in a string list of sentences.
The second assignment makes sentences a DataFrame with one column, sentence.
The next statement adds a column, count.
The apply function applies count_t() to each row in sentences.
Overall, the code reads moby_raw from nltk.collections and finds the average number of words in each sentence.
import nltk
import pandas as pd
import numpy as np
# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
moby_raw = f.read()
def count_t(row):
row['count'] = len(nltk.word_tokenize(row['sentence']))
return row
sentences = nltk.sent_tokenize(moby_raw)
sentences = pd.DataFrame(sentences, columns=['sentence'])
sentences['count'] = ''
sentences = sentences.apply(count_t, axis=1)
mean = sentences['count'].mean()
Unpacking a List of Tuples
Another example from Natural Language Tool Kit
moby_tokens contains a list of tokens in the text of Moby Dick:
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', '
nltk.pos_tag returns a list of tuples containing the word and an abbreviation for its part of speech.
import nltk
import pandas as pd
import numpy as np
moby_tokens = nltk.word_tokenize(moby_raw)
pos_lst = nltk.pos_tag(moby_tokens)
pos_lst = [tup[1] for tup in pos_lst]