天道酬勤,学无止境

naive bayes classifier dynamic training

Is it possible (and how if it is) to dynamically train sklearn MultinomialNB Classifier? I would like to train(update) my spam classifier every time I feed an email in it.

I want this (does not work):

x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
    clf.fit([x_train[i]], [y_train[i]])
preds = clf.predict(x_test)

to have similar result as this (works OK):

x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(x_train, y_train)
preds = clf.predict(x_test)

评论

Scikit-learn supports incremental learning for multiple algorithms, including MultinomialNB. Check the docs here

You'll need to use the method partial_fit() instead of fit(), so your example code would look like:

x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
    if i == 0:
        clf.partial_fit([x_train[i]], [y_train[I]], classes=numpy.unique(y_train))
    else:
        clf.partial_fit([x_train[i]], [y_train[I]])
preds = clf.predict(x_test)

Edit: added the classes argument to partial_fit, as suggested by @BobWazowski

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • 在NLTK中保存朴素贝叶斯训练分类器(Save Naive Bayes Trained Classifier in NLTK)
    问题 关于如何保存经过训练的分类器,我有些困惑。 就像在其中一样,每次我想使用分类器时都要对其进行重新训练显然很糟糕而且很慢,如何保存它并在需要时再次加载它? 代码如下,在此先感谢您的帮助。 我正在将Python与NLTK朴素贝叶斯分类器一起使用。 classifier = nltk.NaiveBayesClassifier.train(training_set) # look inside the classifier train method in the source code of the NLTK library def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist): # Create the P(label) distribution label_probdist = estimator(label_freqdist) # Create the P(fval|label, fname) distribution feature_probdist = {} return NaiveBayesClassifier(label_probdist, feature_probdist) 回答1 保存: import pickle f = open('my_classifier.pickle'
  • Save Naive Bayes Trained Classifier in NLTK
    I'm slightly confused in regard to how I save a trained classifier. As in, re-training a classifier each time I want to use it is obviously really bad and slow, how do I save it and the load it again when I need it? Code is below, thanks in advance for your help. I'm using Python with NLTK Naive Bayes Classifier. classifier = nltk.NaiveBayesClassifier.train(training_set) # look inside the classifier train method in the source code of the NLTK library def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist): # Create the P(label) distribution label_probdist = estimator(label
  • How to use PoS tag as a feature for training data by Naive Bayes classifier?
    I'm researching how to extract keyphrases from document for my thesis. In my research, I used Naive Bayes classifier machine learning for creating a training model of the candidate term features. One of features is PoS tag, I think this feature is important for specifying a term is keyphrase or not. But the input of Naive Bayes (NB) classifier is numbers and the PoS tag is a string. So I don't know the way to represent PoS tag feature as a number in order to become a input feature for NB classifier. Please help me to give your advice. Thanks and regards, Hien Su
  • 在java中使用朴素贝叶斯(weka)进行简单的文本分类(Simple text classification using naive bayes (weka) in java)
    问题 我尝试在我的java代码中进行文本分类naive bayes weka libarary,但我认为分类的结果不正确,我不知道是什么问题。 我使用 arff 文件作为输入。 这是我的训练数据: @relation hamspam @attribute text string @attribute class {spam,ham} @data 'good',ham 'good',ham 'very good',ham 'bad',spam 'very bad',spam 'very bad, very bad',spam 'good good bad',ham 这是我的testing_data: @relation test @attribute text string @attribute class {spam,ham} @data 'good bad very bad',? 'good bad very bad',? 'good',? 'good very good',? 'bad',? 'very good',? 'very very good',? 这是我的代码: public static void NaiveBayes(String training_file, String testing_file) throws FileNotFoundException
  • NLTK Naive Bayes rejecting everything
    I am attempting to use the NLTK Naive Bayes classifier to guess which articles I will like on arXiv.org and indeed I got >85% accuracy from about 500 article titles. Unfortunately, I realized it was just classifying all the articles as 'bad' On the bright side, it gives me a list of informative features that sort of makes sense. I have searched on this site https://stackoverflow.com/search?q=nltk+naive+bayes and looked at NLTK's book as well as another tutorial (both deal with movies reviews from NLTK's own files). Is there anything glaringly wrong with how I train this classifier? Hopefully
  • NLTK 朴素贝叶斯拒绝一切(NLTK Naive Bayes rejecting everything)
    问题 我正在尝试使用 NLTK 朴素贝叶斯分类器来猜测我会喜欢 arXiv.org 上的哪些文章,实际上我从大约 500 篇文章标题中获得了超过 85% 的准确率。 不幸的是,我意识到这只是将所有文章归类为“坏” 从好的方面来说,它给了我一个有意义的信息功能列表。 我在这个网站上搜索过 https://stackoverflow.com/search?q=nltk+naive+bayes 并查看了 NLTK 的书以及另一个教程(都处理来自 NLTK 自己文件的电影评论)。 我训练这个分类器的方式有什么明显的错误吗? 希望下面的代码片段就足够了。 是否有可能拒绝所有对朴素贝叶斯正确的标题? def title_words(doc, vocab): """ features for document classifier""" words = set([x.lower() for x in doc[0].replace(". ", " ").split(' ')]) features = dict([( 'contains(%s)' % word , (word in words)) for word in vocab] ) return features def(classify) # get articles from database conn = sqlite3.connect
  • Simple text classification using naive bayes (weka) in java
    I try to do text classification naive bayes weka libarary in my java code, but i think the result of the classification is not correct, i don't know what's the problem. I use arff file for the input. this is my training data: @relation hamspam @attribute text string @attribute class {spam,ham} @data 'good',ham 'good',ham 'very good',ham 'bad',spam 'very bad',spam 'very bad, very bad',spam 'good good bad',ham this is my testing_data: @relation test @attribute text string @attribute class {spam,ham} @data 'good bad very bad',? 'good bad very bad',? 'good',? 'good very good',? 'bad',? 'very good'
  • How to run naive Bayes from NLTK with Python Pandas?
    I have a csv file with feature (people's names) and label (people's ethnicities). I am able to set up the data frame using Python Pandas, but when I try to link that with NLTK module to run a naive Bayes, I get the following error: Traceback (most recent call last): File "C:\Users\Desktop\file.py", line 19, in <module> classifier = nbc.train(train_set) File "E:\Program Files Extra\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train for fname, fval in featureset.items(): AttributeError: 'str' object has no attribute 'items' Here is my codes: import pandas as pd from
  • 《机器学习实战》-朴素贝叶斯
    目录基于概率论的分类方法:朴素贝叶斯基于贝叶斯决策理论的分类方法朴素贝叶斯的优缺点贝叶斯决策理论条件概率使用条件概率来分类使用朴素贝叶斯进行文档分类使用 Python 进行文本分类准备数据:从文本中构建词向量训练算法:从词向量计算概率测试算法:根据现实情况修改分类器准备数据:文档词袋模型示例:使用朴素贝叶斯过滤垃圾邮件准备数据:切分文本测试算法:使用朴素贝叶斯进行交叉验证总结基于概率论的分类方法:朴素贝叶斯本章内容基于贝叶斯决策理论的分类方法条件概率使用条件概率来分类使用朴素贝叶斯进行文档分类使用 Python 进行文本分类示例:使用朴素贝叶斯过滤垃圾邮件总结k-近邻算法和决策树都要求分类器做出决策,并通过输入特征输出一个最优的类别猜测结果,同时给出这个猜测值的概率估计值。概率论是许多机器学习学习算法的基础,例如在计算信息增益的时候我们讲到的每个特征对应不同值所占的权重,即取得某个特征值的概率。本章则会从一个最简单的概率分类器开始,然后给出一些假设来学习朴素贝叶斯分类器。基于贝叶斯决策理论的分类方法朴素贝叶斯的优缺点优点:在数据较少的情况下仍然有效,可以处理多类别问题 缺点:对于输入数据的准备方式较为敏感 适用数据类型:标称型数据贝叶斯决策理论如图4-1所示,假设我们有一个由两类数据组成的数据集 图4-1 两类数据组成的数据集对于图4-1,我们用\(p1(x,y)\)表示数据点\(
  • 如何使用 Python Pandas 从 NLTK 运行朴素贝叶斯?(How to run naive Bayes from NLTK with Python Pandas?)
    问题 我有一个带有特征(人名)和标签(人种)的 csv 文件。 我能够使用 Python Pandas 设置数据框,但是当我尝试将其与 NLTK 模块链接以运行朴素贝叶斯时,出现以下错误: Traceback (most recent call last): File "C:\Users\Desktop\file.py", line 19, in <module> classifier = nbc.train(train_set) File "E:\Program Files Extra\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train for fname, fval in featureset.items(): AttributeError: 'str' object has no attribute 'items' 这是我的代码: import pandas as pd from pandas import DataFrame import re import numpy as np import nltk from nltk.classify import NaiveBayesClassifier as nbc data = pd.read_csv("C:\Users
  • 如何使用 Spark Naive Bayes 分类器对 IDF 进行文本分类?(How to use spark Naive Bayes classifier for text classification with IDF?)
    问题 我想使用 tf-idf 将文本文档转换为特征向量,然后训练一个朴素贝叶斯算法来对它们进行分类。 我可以轻松加载没有标签的文本文件,并使用 HashingTF() 将其转换为向量,然后使用 IDF() 根据单词的重要性对其进行加权。 但是,如果我这样做,我将摆脱标签,即使顺序相同,似乎也不可能将标签与向量重新组合。 另一方面,我可以在每个单独的文档上调用 HashingTF() 并保留标签,但是我不能在它上面调用 IDF() 因为它需要整个文档语料库(并且标签会妨碍) . 朴素贝叶斯的 spark 文档只有一个示例,其中点已经被标记和矢量化,因此没有太大帮助。 我也看过这个指南:http://help.mortardata.com/technologies/spark/train_a_machine_learning_model 但在这里他只在没有 idf 的每个文档上应用散列函数。 所以我的问题是,是否有一种方法不仅可以对朴素贝叶斯分类器进行矢量化,还可以使用 idf 对单词进行加权? 主要问题似乎是 sparks 坚持只接受标记点的 rdds 作为 NaiveBayes 的输入。 def parseLine(line): label = row[1] # the label is the 2nd element of each row features = row[3] #
  • Python NLTK code snippet to train a classifier (naive bayes) using feature frequency
    I was wondering if anyone could help me through a code snippet that demonstrates how to train Naive Bayes classifier using a feature frequency method as opposed to feature presence. I presume the below as shown in Chap 6 link text refers to creating a featureset using Feature Presence (FP) - def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features Please advice
  • 使用多项朴素贝叶斯分类器时的 ValueError(ValueError when using Multinomial Naive Bayes classifier)
    问题 这是我第一次使用 Scikit,如果问题很愚蠢,我深表歉意。 我正在尝试在 UCI 的蘑菇数据集上实现一个朴素贝叶斯分类器,以根据我自己从头开始编码的 NB 分类器来测试结果。 数据集是分类的,每个特征都有超过 2 个可能的属性,所以我使用多项式 NB 而不是高斯或伯努利 NB。 但是,我不断收到以下错误ValueError: could not convert string to float: 'l' ,我不知道该怎么做。 多项式NB不应该能够获取字符串数据吗? Example line of data - 0th column is the class (p for poisonous and e for edible) and the remaining 22 columns are the features. p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u # based off UCI's mushroom dataset http://archive.ics.uci.edu/ml/datasets/Mushroom df = pd.DataFrame(data) msk = np.random.rand(df.shape[0]) <= training_percent train = data[msk] test =
  • Save and Load testing classify Naive Bayes Classifier in NLTK in another method
    I have try the code from here: Save Naive Bayes Trained Classifier in NLTK. I want to classify tweet into positive class or negative class. this is my code: #learning.py def main_learning(): ....... classifier = nltk.NaiveBayesClassifier.train(feature_set) save_classifier(classifier) classifier2 = load_classifier() print classifier2.classify(get_features("My tweet is bad".split()) def save_classifier(classifier): f = open('my_classifier.pickle', 'wb') pickle.dump(classifier, f) f.close() def load_classifier(): f = with open('my_classifier.pickle') classifier = pickle.load(f) f.close return
  • R语言实现 朴素贝叶斯分类
    用R进行朴素贝叶斯分类 原理介绍应用领域基于贝叶斯定理的条件概率朴素贝叶斯算法 Example: Filtering spam SMS messages ----Step 1: Exploring and preparing the data ----read the sms data into the sms data frameexamine the structure of the sms dataconvert spam/ham to factor.examine the type variable more carefullybuild a corpus using the text mining ('tm') packageexamine the sms corpusclean up the corpus using tm_map()show the difference between sms_corpus and corpus_clean所有短信字母变成小写字母&删除数字删除停用词删除标点符号删除多于空格观察一些变换前后的结果create a document-term sparse matrixcreating training and test datasetsalso save the labelscheck that the proportion of
  • 活用西瓜书——sklearn包中的朴素贝叶斯分类器
    引言 最近在读西瓜书,查阅了多方资料,恶补了数值代数、统计概率和线代,总算是勉强看懂了西瓜书中的公式推导。但是知道了公式以后还是要学会应用的,几经摸索发现python下的sklearn包把机器学习中经典的算法都封装好了,因此,打算写几篇博客记录一下sklearn包下的常用学习算法的使用,防止自己以后忘了,嘿嘿。 1.朴素贝叶斯 朴素贝叶斯算法可以说是机器学习中的经点算法了。它采用了概率论中的贝叶斯学派的观点,使用条件概率的理论来导出并最小化误分类的概率。 在一个分类问题中,不论总类别数的多少,一共有两种分类结果,即分类正确或者分类错误。我们使用贝叶斯学派的理论,认为分类错误是在“已知样本属性,把样本分在了错误的类别”,这是个条件概率的问题,即认为条件风险为:R(c|x),我们研究分类问题,即研究最小化条件风险的问题: h ∗ = a r g m i n R ( c ∣ x ) h^*=argminR(c|x) h∗=argminR(c∣x),我们称 h ∗ h^* h∗为贝叶斯最优分类器。 接下来,我们再对R(c|x)做变形: R ( c i ∣ x ) = ∑ j = 1 N λ i j P ( c j ∣ x ) R(c_i|x)=\sum_{j=1}^{N} λ_ij P(c_j|x) R(ci​∣x)=∑j=1N​λi​jP(cj​∣x),其中,仅当i=j时才是分类正确
  • ValueError when using Multinomial Naive Bayes classifier
    This is my first time using Scikit, and apologies if the question is stupid. I'm trying to implement a naive bayes classifier on UCI's mushroom dataset to test the results against my own NB classifier coded from scratch. The dataset is categorical and each feature has more than 2 possible attributes so I used a multinomial NB instead of a Gaussian or Bernouilli NB. However, I keep getting the following error ValueError: could not convert string to float: 'l' , and am not sure what to do. Shouldn't a multinomial NB be able to take string data? Example line of data - 0th column is the class (p
  • 提高朴素贝叶斯分类器准确性的方法?(Ways to improve the accuracy of a Naive Bayes Classifier?)
    问题 我正在使用朴素贝叶斯分类器将数千个文档分为 30 个不同的类别。 我已经实现了一个朴素贝叶斯分类器,并且通过一些特征选择(主要是过滤无用词),我获得了大约 30% 的测试准确度和 45% 的训练准确度。 这明显比随机好,但我希望它更好。 我试过用 NB 实现 AdaBoost,但它似乎并没有给出明显更好的结果(文献似乎对此存在分歧,一些论文说带有 NB 的 AdaBoost 没有给出更好的结果,其他人则这样做)。 您是否知道 NB 的任何其他扩展可能会提供更好的准确性? 回答1 根据我的经验,经过适当训练的朴素贝叶斯分类器通常非常准确(而且训练速度非常快——明显比我使用过的任何分类器构建器都要快)。 所以当你想改进分类器预测时,你可以看几个地方: 调整分类器(调整分类器的可调参数); 应用某种分类器组合技术(例如,集成、提升、装袋); 或者你可以查看提供给分类器的数据——添加更多数据、改进基本解析或优化从数据中选择的特征。 w/r/t 朴素贝叶斯分类器,参数调整有限; 我建议关注你的数据——即你的预处理和特征选择的质量。 一、数据解析(预处理) 我假设您的原始数据类似于每个数据点的原始文本字符串,通过一系列处理步骤,您可以将每个字符串转换为每个数据点的结构化向量(一维数组),以便每个偏移量对应一个特征(通常是一个字),该偏移量中的值对应于频率。 词干:手动还是使用词干库?
  • 在NLTK中实现单词袋Naive-Bayes分类器(Implementing Bag-of-Words Naive-Bayes classifier in NLTK)
    问题 我基本上和这个人有相同的问题。.NLTK书中针对朴素贝叶斯分类器的示例仅考虑单词是否在文档中作为特征出现。看一下(“词袋”)。 答案之一似乎表明,内置NLTK分类器无法做到这一点。 是这样吗如何使用NLTK进行频率/单词袋NB分类? 回答1 scikit-learn具有多项朴素贝叶斯的实现,在这种情况下,它是朴素贝叶斯的正确变体。 不过,支持向量机(SVM)可能会更好地工作。 正如Ken在评论中指出的那样,NLTK对于scikit-learn分类器来说是一个很好的包装器。 根据文档进行了修改,这是一个有点复杂的操作,它执行TF-IDF加权,根据chi2统计量选择1000个最佳功能,然后将其传递给多项式朴素的贝叶斯分类器。 (我打赌这有点笨拙,因为我对NLTK或scikit-learn都不是很熟悉。) import numpy as np from nltk.probability import FreqDist from nltk.classify import SklearnClassifier from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.naive_bayes
  • 如何在带有原始朴素贝叶斯分类器和NLTK的scikit中使用k倍交叉验证(How to use the a k-fold cross validation in scikit with naive bayes classifier and NLTK)
    问题 我的语料库很小,我想使用10倍交叉验证来计算朴素贝叶斯分类器的准确性,该怎么做。 回答1 您可以选择自行设置,也可以使用NLTK-Trainer之类的东西,因为NLTK不直接支持机器学习算法的交叉验证。 我可能建议您仅使用另一个模块为您执行此操作,但是如果您确实想编写自己的代码,则可以执行以下操作。 假设您想10倍,则必须将训练集划分为10个子集,在9/10上训练,在其余1/10上进行测试,然后对子集的每种组合( 10 )进行此操作。 假设您的训练集位于一个名为training的列表中,那么完成此操作的一种简单方法是, num_folds = 10 subset_size = len(training)/num_folds for i in range(num_folds): testing_this_round = training[i*subset_size:][:subset_size] training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:] # train using training_this_round # evaluate against testing_this_round # save accuracy # find mean accuracy over