Last Updated on November 20, 2022 by David Vause

The DataFrame apply() Function

An example using Natural Language Tool Kit

moby_raw: is text from nltk.collections
The first assignment of sentences results in a string list of sentences.
The second assignment makes sentences a DataFrame with one column, sentence.
The next statement adds a column, count.
The apply function applies count_t() to each row in sentences.

Overall, the code reads moby_raw from nltk.collections and finds the average number of words in each sentence.

 

import nltk
import pandas as pd
import numpy as np

# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
    moby_raw = f.read()

def count_t(row):
    row['count'] = len(nltk.word_tokenize(row['sentence']))
    return row

sentences = nltk.sent_tokenize(moby_raw)
sentences = pd.DataFrame(sentences, columns=['sentence'])
sentences['count'] = ''
sentences = sentences.apply(count_t, axis=1)
mean = sentences['count'].mean()

Unpacking a List of Tuples

Another example from Natural Language Tool Kit

moby_tokens contains a list of tokens in the text of Moby Dick:

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', '

nltk.pos_tag returns a list of tuples containing the word and an abbreviation for its part of speech.

 



import nltk
import pandas as pd
import numpy as np

moby_tokens = nltk.word_tokenize(moby_raw)
pos_lst = nltk.pos_tag(moby_tokens)

pos_lst = [tup[1] for tup in pos_lst]

 

 

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *