天道酬勤,学无止境

nltk

将 TextGrid 文件读入 NLTK(Reading a TextGrid file into NLTK)

问题 我正在尝试将 TextGrid 文件读入 NLTK,但遇到了一些麻烦。 我知道 Textgrid 有一个解析器(如下所示:http://nltk.googlecode.com/svn/trunk/nltk_contrib/nltk_contrib/textgrid.py)。 不幸的是,我是 NLTK 的新手,我不知道如何使用解析器。 任何帮助将不胜感激。 回答1 不幸的是,了解 NLTK 并没有帮助:我查看了 textgrid 的源代码,虽然它是由 NLTK 的核心团队编写的,但它与其他 NLTK“语料库阅读器”没有任何共同之处。 我建议您研究源代码中的文件头并进行一些实验——文档就足够了。 让您开始:看起来您可以通过将打开的文件指针传递给类TextGrid的构造函数来加载 TextGrid 文件: fp = open("grid_file.praat") grid = TextGrid(fp) for tier in grid: # do something with the Tier object 附注。 这不是一个非常完整的答案,但我不能在评论中包含代码片段。 回答2 聚会有点晚了,但我来了: 您可以将TextGrid对象另存为 JSON 文件,然后使用标准 Python 库将其读入 NLTK,如本答案所示。 Praat 不包含(此时)JSON 转换器,但

2021-09-24 05:21:07    分类:技术分享    nltk   praat

用 NP 块绘制扁平的 NLTK 解析树(Drawing a flatten NLTK Parse Tree with NP chunks)

问题 我想用 NLTK 分析句子并将它们的块显示为树。 NLTK 提供了tree.draw()方法来绘制一棵树。 下面的代码为句子“the little yellow dog barked at the cat”绘制了一棵树: import nltk sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] pattern = "NP: {<DT>?<JJ>*<NN>}" NPChunker = nltk.RegexpParser(pattern) result = NPChunker.parse(sentence) result.draw() 结果是这棵树: 我如何获得一棵树这样的多一层? 回答1 你需要“升级”你的非 NP 词,这里有一个技巧: import nltk sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] pattern

2021-09-24 02:00:09    分类:技术分享    python   tree   nlp   draw   nltk

在熊猫数据框中删除包含非英语单词的行(dropping row containing non-english words in pandas dataframe)

问题 我把这个推特语料库变成了熊猫数据框,我试图找到没有英文的推文并将它们从数据框中删除,所以我这样做了: for j in range(0,150): if not wordnet.synsets(df.i[j]):#Comparing if word is non-English df.drop(j) print(df.shape) 但我检查了形状,没有一行被丢弃。 我使用 drop 函数是错误的,还是需要跟踪行的索引? 回答1 那是因为df.drop()返回一个副本而不是修改原始数据帧。 尝试设置inplace=True for j in range(0,150): if not wordnet.synsets(df.i[j]):#Comparing if word is non-English df.drop(j, inplace=True) print(df.shape) 回答2 这将过滤掉我们的 Pandas 数据框中的所有非英语行。 import nltk nltk.download('words') from nltk.corpus import words import pandas as pd data1 = pd.read_csv("testdata.csv") Word = list(set(words.words())) df_final = data1

2021-09-23 23:05:57    分类:技术分享    python   pandas   nltk

Force Stanford CoreNLP Parser to Prioritize 'S' Label at Root Level

Greetings NLP Experts, I am using the Stanford CoreNLP software package to produce constituency parses, using the most recent version (3.9.2) of the English language models JAR, downloaded from the CoreNLP Download page. I access the parser via the Python interface from the NLTK module nltk.parse.corenlp. Here is a snippet from the top of my main module: import nltk from nltk.tree import ParentedTree from nltk.parse.corenlp import CoreNLPParser parser = CoreNLPParser(url='http://localhost:9000') I also fire up the server using the following (fairly generic) call from the terminal: java -mx4g

2021-09-23 22:55:41    分类:问答    python   nlp   nltk   stanford-nlp

从 NLTK 分布中删除除停用词之外的特定词(Dropping specific words out of an NLTK distribution beyond stopwords)

问题 我有一个像这样简单的句子。 我想把A和IT等介词和单词从列表中删除。 我查看了自然语言工具包 (NLTK) 文档,但找不到任何内容。 有人可以告诉我怎么做吗? 这是我的代码: import nltk from nltk.tokenize import RegexpTokenizer test = "Hello, this is my sentence. It is a very basic sentence with not much information in it" test = test.upper() tokenizer = RegexpTokenizer(r'\w+') tokens = tokenizer.tokenize(test) fdist = nltk.FreqDist(tokens) common = fdist.most_common(100) 回答1 本质上, nltk.probability.FreqDist是一个collections.Counter对象(https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L61)。 给定一个字典对象,有几种方法可以过滤它: 1. 读入 FreqDist 并使用 lambda 函数对其进行过滤 >>> import nltk >>>

2021-09-23 21:55:16    分类:技术分享    python   list   nltk

How to interpret Python NLTK bigram likelihood ratios?

I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question). import nltk.collocations import nltk.corpus import collections bgm = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words()) scored = finder.score_ngrams(bgm.likelihood_ratio) # Group bigrams by first word in bigram. prefix_keys = collections.defaultdict(list) for key, scores in scored: prefix_keys[key[0]].append((key[1], scores)) for key in prefix_keys: prefix_keys[key].sort(key = lambda x: -x[1])

2021-09-23 20:50:56    分类:问答    nlp   nltk   n-gram

what is the difference between tfidf vectorizer and tfidf transformer

I know that the formula for tfidf vectorizer is Count of word/Total count * log(Number of documents / no.of documents where word is present) I saw there's tfidf transformer in the scikit learn and I just wanted to difference between them. I could't find anything that's helpful.

2021-09-23 19:48:21    分类:问答    python   scikit-learn   nltk   tf-idf   tfidfvectorizer

你如何在 iPython/Jupyter 中制作内联的 NLTK draw() 树(How do you make NLTK draw() trees that are inline in iPython/Jupyter)

问题 对于 iPython/Jupyter 中的 Matplotlib 绘图,您可以使笔记本绘图与 %matplotlib inline 如何为树的 NLTK draw() 做同样的事情? 这是文档 http://www.nltk.org/api/nltk.draw.html 回答1 基于这个答案: import os from IPython.display import Image, display from nltk.draw import TreeWidget from nltk.draw.util import CanvasFrame def jupyter_draw_nltk_tree(tree): cf = CanvasFrame() tc = TreeWidget(cf.canvas(), tree) tc['node_font'] = 'arial 13 bold' tc['leaf_font'] = 'arial 14' tc['node_color'] = '#005990' tc['leaf_color'] = '#3F8F57' tc['line_color'] = '#175252' cf.add_widget(tc, 10, 10) cf.print_to_file('tmp_tree_output.ps') cf.destroy() os

2021-09-23 19:48:08    分类:技术分享    ipython   draw   nltk   jupyter

Reading a TextGrid file into NLTK

I'm trying to read in a TextGrid file into NLTK, but I'm having some trouble. I understand that there is a parser for Textgrid ( as seen here: http://nltk.googlecode.com/svn/trunk/nltk_contrib/nltk_contrib/textgrid.py). Unfortunately, I'm new to NLTK, and I have no idea how to use the parser. Any help would be very appreciated.

2021-09-23 17:31:28    分类:问答    nltk   praat

如何使用python为文本添加标点符号?(How to add punctuation to text using python?)

问题 我正在使用IBM Watson Speech To Text Service API 。 对于那些不知道此服务用于转录音频的人。 您将音频文件上传到服务,它会返回文本。 到目前为止,服务一直很好,但问题是返回的文本不包含标点符号。 我尝试用nltk解决这个问题,但没有结果。 我试过的一些nltk代码。 #string is the text string = """Hey guys today I'm gonna show you how to make bulletproof coffee so free guys and never heard a blooper coffee it's been around since like two thousand two been around for awhile but i think it's been a lot more popular now probably within the last maybe year two years so for me I recently just started doing bulletproof coffee say about a couple so basically regular coffee you know you get your regular coffee

2021-09-22 09:57:08    分类:技术分享    python   nltk