天道酬勤,学无止境

How to call the ClassifierBasedTagger() in NLTK

I have followed in the documentation from nltk book (chapter 6 and 7) and other ideas to train my own model for named entity recognition. After building a feature function and ClassifierBasedTagger like this:

class NamedEntityChunker(ChunkParserI):
    def __init__(self, train_sents, feature_detector=features, **kwargs):
        assert isinstance(train_sents, Iterable)
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         tree2conlltags(sent)]
                        for sent in train_sents]

        #other possible option: self.feature_detector = features
        self.tagger = ClassifierBasedTagger(tagged_sents, feature_detector=feature_detector, **kwargs)

    def parse(self, tagged_sent):
        chunks = self.tagger.tag(tagged_sent)

        iob_triplets = [(w, t, c) for ((w, t), c) in chunks]

        # Transform the list of triplets to nltk.Tree format
        return conlltags2tree(iob_triplets)

I am having problems when caling the classifiertagger from another script where I load my traning and test data. I call the classifier using a portion from my training data for testing purpose from:

chunker = NamedEntityChunker(training_samples[:500])

No matter what I change in my classifier I keept getting the error:

   self.tagger = ClassifierBasedTagger(tagged_sents, feature_detector=feature_detector, **kwargs)
TypeError: __init__() got multiple values for argument 'feature_detector'

What am I doing wrong here, I supossed the feature function is working fine and I don't have to pass anything else when calling NamedEntityChunker().

my second question, is there a way to save the model being trained and reuse it lataer, how can I approach this? This is a follow up of my last question on training data

Thanks for any advise

评论

I finally realised what I was missing: when defining BasedTagger you have to pass an argument for "tagged_sents", like this:

#self.tagger = ClassifierBasedTagger(train=train_sents, feature_detector=features, **kwargs) 

now when I call the chunker NamedEntityChunker() everything is working.

Are you sure your code is exactly as you report it? This should not produce the problem you report; but you will get this behavior if you pass a keyword argument that is also a key in the kwargs variable:

>>> def test(a, b):   # In fact the signature of `test` is irrelevant
        pass
>>> args = { 'a'=1, 'b'=2 }
>>> test(a=0, **args)
TypeError: test() got multiple values for keyword argument 'a'

So, figure out where the problem arises and fix it. Have your methods print out their arguments to help you debug the problem.

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • 如何保存 Python NLTK 对齐模型以备后用?(How to save Python NLTK alignment models for later use?)
    问题 在 Python 中,我使用 NLTK 的对齐模块在平行文本之间创建单词对齐。 对齐双文本可能是一个耗时的过程,尤其是在大量语料库上完成时。 有一天批量进行对齐并在以后使用这些对齐会很好。 from nltk import IBMModel1 as ibm biverses = [list of AlignedSent objects] model = ibm(biverses, 20) with open(path + "eng-taq_model.txt", 'w') as f: f.write(model.train(biverses, 20)) // makes empty file 创建模型后,我如何 (1) 将其保存到磁盘并 (2) 稍后重用它? 回答1 直接的答案是腌制它,见 https://wiki.python.org/moin/UsingPickle 但是因为 IBMModel1 返回一个 lambda 函数,所以不可能使用默认的pickle / cPickle对其进行pickle (请参阅 https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74 和 https: //github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104)
  • Instantiating and using StanfordTagger within NLTK [duplicate]
    This question already has answers here: How to use Stanford Parser in NLTK using Python (18 answers) Closed 10 months ago. I apologize for the newbie-nature of this question - I have been trying to figure out Python packaging and namespaces, but the finer points seem to elude me. To wit, I would like to use the Python wrapper to Stanford part-of-speech tagger. I had no trouble finding the documentation here, which provides a use sample: st = StanfordTagger('bidirectional-distsim-wsj-0-18.tagger') st.tag('What is the airspeed of an unladen swallow ?'.split()) [('What', 'WP'), ('is', 'VBZ'), (
  • 如何下载NLTK数据?(How do I download NLTK data?)
    问题 更新答案:NLTK适用于2.7。 我有3.2。 我卸载了3.2,然后安装了2.7。 现在可以了!! 我已经安装了NLTK并尝试下载NLTK数据。 我所做的就是遵循此站点上的说明:http://www.nltk.org/data.html 我下载了NLTK,进行了安装,然后尝试运行以下代码: >>> import nltk >>> nltk.download() 它给了我如下错误信息: Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> nltk.download() AttributeError: 'module' object has no attribute 'download' Directory of C:\Python32\Lib\site-packages 尝试了nltk.download()和nltk.downloader() ,都给了我错误消息。 然后,我使用help(nltk)提取了软件包,它显示了以下信息: NAME nltk PACKAGE CONTENTS align app (package) book ccg (package) chat (package) chunk (package) classify (package) cluster
  • 使用 nltk 的一般同义词和词性处理(General synonym and part of speech processing using nltk)
    问题 我正在尝试为句子中重要的单词(即不是“a”或“the”)创建一个通用的同义词标识符,并且我在 python 中使用了自然语言工具包(nltk)。 我遇到的问题是 nltk 中的同义词查找器需要词性参数才能链接到它的同义词。 我试图解决这个问题是使用 nltk 中存在的简化词性标注器,然后减少第一个字母以将此参数传递给同义词查找器,但这不起作用。 def synonyms(Sentence): Keywords = [] Equivalence = WordNetLemmatizer() Stemmer = stem.SnowballStemmer('english') for word in Sentence: word = Equivalence.lemmatize(word) words = nltk.word_tokenize(Sentence.lower()) text = nltk.Text(words) tags = nltk.pos_tag(text) simplified_tags = [(word, simplify_wsj_tag(tag)) for word, tag in tags] for tag in simplified_tags: print tag grammar_letter = tag[1][0].lower() if grammar
  • 如何在数据框中使用word_tokenize(how to use word_tokenize in data frame)
    问题 我最近开始使用nltk模块进行文本分析。 我陷入了困境。 我想在数据帧上使用word_tokenize,以便获取在数据帧的特定行中使用的所有单词。 data example: text 1. This is a very good site. I will recommend it to others. 2. Can you please give me a call at 9983938428. have issues with the listings. 3. good work! keep it up 4. not a very helpful site in finding home decor. expected output: 1. 'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.' 2. 'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings' 3. 'good','work','!','keep','it','up' 4. 'not','a','very','helpful','site'
  • General synonym and part of speech processing using nltk
    I'm trying to create a general synonym identifier for the words in a sentence which are significant (i.e. not "a" or "the"), and I am using the natural language toolkit(nltk) in python for it. The problem I am having is that the synonym finder in nltk requires a part of speech argument in order to be linked to its synonyms. My attempted fix for this was to use the simplified part of speech tagger present in nltk, and then reduce the first letter in order to pass this argument into the synonym finder, however this is not working. def synonyms(Sentence): Keywords = [] Equivalence =
  • NLTK:设置代理服务器(NLTK: set proxy server)
    问题 我正在尝试学习 NLTK - 用 Python 编写的自然语言工具包,我想安装一个示例数据集来运行一些示例。 我的网络连接使用代理服务器,我正在尝试按如下方式指定代理地址: >>> nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD')) >>> nltk.download() 但我收到一个错误: Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' object is not callable 我决定在调用nltk.download()之前设置一个ProxyBasicAuthHandler : import urllib2 auth_handler = urllib2.ProxyBasicAuthHandler(urllib2.HTTPPasswordMgrWithDefaultRealm()) auth_handler.add_password(realm=None, uri='http://proxy.example.com:3128/', user='USERNAME', passwd='PASSWORD') opener = urllib2.build
  • Lambda not supporting NLTK file size
    I am writing a python script that analyses a piece of text and returns the data in JSON format. I am using NLTK, to analyze the data. Basically, this is my flow: Create an endpoint (API gateway) -> calls my lambda function -> returns JSON of required data. I wrote my script, deployed to lambda but I ran into this issue: Resource \u001b[93mpunkt\u001b[0m not found. Please use the NLTK Downloader to obtain the resource: \u001b[31m>>> import nltk nltk.download('punkt') \u001b[0m Searched in: - '/home/sbx_user1058/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk
  • Issues while calling nltk libraries using jython
    I am trying to call nltk libraries which are integrated in my python code. As per requirement I need to call them through my java code, therefore, I am using Jython for the integration. If my pyhton code does not includes any nltk library, in such case it works fine. But when it includes nltk library, it gives me import error. I have already added nltk related system paths to the interpreter PySystemState sys = Py.getSystemState(); sys.path.append(new PyString("c:\\Python27\\Lib")); sys.path.append(new PyString("c:\\Python27\\Lib\\site-packages")); sys.path.append(new PyString("C:\\Python27\
  • 如何查看安装了哪个版本的nltk、scikit learn?(how to check which version of nltk, scikit learn installed?)
    问题 在 shell 脚本中,我正在检查是否安装了此软件包,如果未安装,则安装它。 所以使用shell脚本: import nltk echo nltk.__version__ 但它会在import行停止 shell 脚本 在linux终端中尝试以这种方式查看: which nltk 没有想到它已安装。 有没有其他方法可以在shell脚本中验证这个包的安装,如果没有安装,也安装它。 回答1 import nltk是 Python 语法,因此在 shell 脚本中不起作用。 要测试nltk和scikit_learn的版本,您可以编写一个Python 脚本并运行它。 这样的脚本可能看起来像 import nltk import sklearn print('The nltk version is {}.'.format(nltk.__version__)) print('The scikit-learn version is {}.'.format(sklearn.__version__)) # The nltk version is 3.0.0. # The scikit-learn version is 0.15.2. 请注意,并非所有 Python 包都保证具有__version__属性,因此对于其他一些包可能会失败,但对于 nltk 和 scikit-learn 至少它会起作用
  • 如何在 NLTK 中保存自定义分类语料库(How to save a custom categorized corpus in NLTK)
    问题 如何将新语料库“保存”到 NLTK 语料库数据中? 就我而言,在我在 NLTK 中创建了一个新的自定义分类语料库之后,就像在这个页面中所说:Creating a custom Categories corpus in NLTK and Python 我想像 NLTK 中已经安装的语料库一样使用它(比如 movie_reviews)。 我怎样才能做到这一点? 换句话说,我读过的新语料库(比如movie_reviews_0)怎么可以这样调用: >>> import nltk >>> from nltk.corpus import movie_reviews_0 回答1 您可以将其添加到您自己的nltk_data/corpora文件夹中,该文件夹应该位于您的主目录中的某个位置。 例如,如果您使用的是 Mac,它将位于~/nltk_data/corpora 。 看起来您还必须将新语料库附加到.../site-packages/nltk/corpus/内的__init__.py 。
  • 在 Python 2 中使用 Python 3(Using Python 3 with Python 2)
    问题 我有一个 Python 3 文件。 我想使用互联网上的一个开源工具(nltk),但不幸的是它只支持Python 2。我无法将其转换为Python 3,也无法将我的Python 3文件转换为Python 2。 如果用户没有给出某个参数(在 argparse 上),那么我会在我的文件中做一些事情。 但是,如果用户确实给出了某个论点,则我需要使用 nltk。 编写一个使用 nltk 的 Python 2 脚本,然后在我的 Python 3 脚本中执行该脚本 我目前的想法是用 Python 2 编写一个脚本,用 nltk 执行我想要的操作,然后从我当前的 Python 3 脚本中运行它。 但是,我实际上不知道如何做到这一点。 我找到了这个代码: os.system(command) ,所以我将它修改为os.system("python py2.py") (其中 py2.py 是我新编写的 Python 2 文件)。 我不确定这是否有效。 我也不知道这是否是解决我的问题的最有效方法。 我在互联网上找不到任何关于它的信息。 传输的数据可能会非常大。 目前,我的测试数据大约是 6600 行,utf-8。 在我的情况下,功能比(在某种程度上)需要多长时间更重要。 另外,如何将值从 Python 2 脚本传递到 Python 3 脚本? 谢谢 回答1 有没有其他方法可以做到这一点? 好吧
  • How do I download NLTK data?
    Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!! I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution on this site: http://www.nltk.org/data.html I downloaded NLTK, installed it, and then tried to run the following code: >>> import nltk >>> nltk.download() It gave me the error message like below: Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> nltk.download() AttributeError: 'module' object has no attribute 'download' Directory of C:\Python32\Lib\site-packages Tried both
  • AttributeError: 'module' 对象没有属性 'scores'(AttributeError: ‘module’ object has no attribute 'scores')
    问题 尝试使用nltk.metrics.scores的函数precision时出现错误。 我尝试了许多不同的导入,但没有成功。 我查看了 python 目录中的文件(见下文),该函数在那里,但只是“无法触摸这个/那个”。 我在看: /usr/local/lib/python2.7/dist-packages/nltk/metrics /usr/local/lib/python2.7/dist-packages/nltk/metrics/scores.py 这是我的终端向我展示的: File "/home/login/projects/python-projects/test.py", line 39, in <module> precision = nltk.metrics.scores.precision(correct[CLASS_POS], predicted[CLASS_POS]) AttributeError: 'module' object has no attribute 'scores' 在我的搜索中,我偶然发现了这个链接,它给了我两个选择,但我不知道如何进行其中任何一个: 造成这种情况的明显原因是 settings.py 没有包含INSTALLED_APPS列出的 blah 的目录。 一个不太明显的原因:如果目录不包含文件__init__.py您也会收到此错误。
  • How to use malt parser in python nltk
    As a part of my academic project I need to parse a bunch of arbitrary sentences into a dependency graph. After a searching a lot I got the solution that I can use Malt Parser for parsing text with its pre trained grammer. I have downloaded pre-trained model (engmalt.linear-1.7.mco) from http://www.maltparser.org/mco/mco.html. BUt I don't know how to parse my sentences using this grammer file and malt parser (by the python interface for malt). I have downloaded latest version of malt parser (1.7.2) and moved it to '/usr/lib/' import nltk; parser =nltk.parse.malt.MaltParser() txt="This is a test
  • Python AttributeError when calling nltk Brown corpus
    I have a windows 7 64 bit machine and just installed the 64 bit version of python 3.5 and the nltk package (which I downloaded as a .whl file from http://www.lfd.uci.edu/~gohlke/pythonlibs/#nltk and pip installed from the windows command prompt). I've also downloaded all the corpora and other data associated with nltk. In python IDLE I am able to import nltk and the brown corpus all right, but when I try to examine some words in the Brown corpus I get an AttributeError ("can't set attribute"). Here is my code: >>> import nltk >>> from nltk.corpus import brown >>> brown.words() It is the last
  • 如何在python nltk中使用麦芽解析器(How to use malt parser in python nltk)
    问题 作为我的学术项目的一部分,我需要将一堆任意句子解析为依赖图。 经过大量搜索,我得到了一个解决方案,即我可以使用 Malt Parser 来通过其预训练的语法来解析文本。 我已经从 http://www.maltparser.org/mco/mco.html 下载了预训练模型 (engmalt.linear-1.7.mco)。 但是我不知道如何使用这个语法文件和麦芽解析器(通过麦芽的 python 接口)来解析我的句子。 我已下载最新版本的麦芽解析器 (1.7.2) 并将其移至“/usr/lib/” import nltk; parser =nltk.parse.malt.MaltParser() txt="This is a test sentence" parser.train_from_file('/home/rohith/malt-1.7.2/engmalt.linear-1.7.mco') parser.raw_parse(txt) 执行最后一行后,显示以下错误消息 Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> parser.raw_parse(txt) File "/usr/local/lib/python2.7/dist-packages/nltk-2.0b5
  • 如何匹配 NLTK CFG 中的整数?(How to match integers in NLTK CFG?)
    问题 如果我想定义一个语法,其中一个标记将匹配一个整数,我如何使用 nltk 的字符串 CFG 来实现它? 例如 - S -> SK SO FK SK -> 'SELECT' SO -> '\d+' FK -> 'FROM' 回答1 像这样创建一个数字短语: import nltk groucho_grammar = nltk.CFG.fromstring(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP | 'I' | NUM N VP -> V NP | VP PP Det -> 'an' | 'my' N -> 'elephant' | 'pajamas' | 'elephants' V -> 'shot' P -> 'in' NUM -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10' """) sent = 'I shot 3 elephants'.split() parser = nltk.ChartParser(groucho_grammar) for tree in parser.parse(sent): print(tree) [出去]: (S (NP I) (VP (V shot) (NP (NUM 3) (N elephants))
  • NLTK: set proxy server
    I'm trying to learn NLTK - Natural Language Toolkit written in Python and I want install a sample data set to run some examples. My web connection uses a proxy server, and I'm trying to specify the proxy address as follows: >>> nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD')) >>> nltk.download() But I get an error: Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' object is not callable I decided to set up a ProxyBasicAuthHandler before calling nltk.download(): import urllib2 auth_handler = urllib2.ProxyBasicAuthHandler(urllib2
  • 斯坦福解析器的 nltk 接口 [重复](nltk interface to stanford parser [duplicate])
    问题 这个问题在这里已经有了答案: 如何使用 Python 在 NLTK 中使用斯坦福解析器(18 个回答) 5年前关闭。 我在通过 python NLTK 访问斯坦福解析器时遇到问题(他们为 NLTK 开发了一个接口) 导入 nltk.tag.stanford 回溯(最近一次调用最后一次): 文件“”,第 1 行,在 导入错误:没有名为 stanford 的模块 回答1 您可以使用 NLTK 的斯坦福解析器。 检查此链接以了解如何使用它 - http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford 我想 NLTK 中的 stanford 模块没有问题,它对我来说效果很好。 检查您的 NLTK 版本。 旧版本中没有 stanford 模块。 尝试最新版本的 NLTK。 您还可以将此 python 包装器用于 stanford 解析器,它非常高效,因为它的方法多种多样。 https://bitbucket.org/torotoki/corenlp-python 回答2 NLTK 中没有名为stanford 的模块。您可以存储stanford 解析器的输出并通过python 程序使用它。