天道酬勤,学无止境

train-test-split

导入错误:无法导入名称“LatentDirichletAllocation”(ImportError: cannot import name 'LatentDirichletAllocation')

问题 我正在尝试导入以下内容: from sklearn.model_selection import train_test_split 并出现以下错误,这是堆栈跟踪: ImportError Traceback (most recent call last) <ipython-input-1-bdd2a2f20673> in <module> 2 import pandas as pd 3 from sklearn.model_selection import train_test_split ----> 4 from sklearn.tree import DecisionTreeClassifier 5 from sklearn.metrics import accuracy_score 6 from sklearn import tree ~/.local/lib/python3.6/site-packages/sklearn/tree/__init__.py in <module> 4 """ 5 ----> 6 from ._classes import BaseDecisionTree 7 from ._classes import DecisionTreeClassifier 8 from ._classes import DecisionTreeRegressor

2021-12-23 19:01:49    分类:技术分享    python   python-3.x   scikit-learn   sklearn-pandas   train-test-split

在测试和训练数据集中使用基于时间的拆分来拆分数据(Splitting data using time-based splitting in test and train datasets)

问题 我知道train_test_split随机拆分它,但我需要知道如何根据时间拆分它。 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # this splits the data randomly as 67% test and 33% train 如何根据时间拆分相同的数据集作为 67% 的训练和 33% 的测试? 数据集有一列 TimeStamp。 我尝试搜索类似的问题,但不确定该方法。 有人能简单解释一下吗? 回答1 一种简单的方法来做到这一点.. 第一:按时间排序数据 第二: import numpy as np train_set, test_set= np.split(data, [int(.67 *len(data))]) 这使得 train_set 具有前 67% 的数据,而 test_set 具有其余 33% 的数据。 回答2 在时间序列数据集上,数据拆分以不同的方式发生。 有关更多信息,请参阅此链接。 或者,您可以尝试使用 scikit-learn 包中的 TimeSeriesSplit。 所以主要思想是这样的,假设您根据时间戳有10个数据点。 现在拆分将是这样的: Split 1 : Train_indices : 1

2021-12-23 05:47:02    分类:技术分享    python   scikit-learn   timestamp   train-test-split

CountVectorizer MultinomialNB 中的维度不匹配错误(dimension mismatch error in CountVectorizer MultinomialNB)

问题 在我提出这个问题之前,我不得不说我已经彻底阅读了这个板上超过 15 个类似的主题,每个主题都有不同的建议,但所有这些都无法让我正确。 好的,所以我将我的“垃圾邮件”文本数据(最初为 csv 格式)分成训练集和测试集,使用 CountVectorizer 及其“fit_transform”函数来拟合语料库的词汇并从文本中提取字数特征。 然后我应用 MultinomialNB() 从训练集学习并预测测试集。 这是我的代码(简化): from sklearn.feature_extraction.text import CountVectorizer from sklearn.cross_validation import train_test_split from sklearn.naive_bayes import MultinomialNB # loading data # data contains two columns ('text', 'target') spam = pd.read_csv('spam.csv') spam['target'] = np.where(spam_data['target']=='spam',1,0) # split data X_train, X_test, y_train, y_test = train_test_split(spam

2021-12-22 16:23:02    分类:技术分享    python   naivebayes   countvectorizer   train-test-split

为分类问题拆分数据集的正确程序是什么?(What is the correct procedure to split the Data sets for classification problem?)

问题 我是机器学习和深度学习的新手。 我想在训练前澄清我对train_test_split疑问 我有一个大小为(302, 100, 5)的数据集,其中, (207,100,5)属于class 0 (95,100,5)属于class 1. 我想使用 LSTM 执行分类(因为,序列数据) 我如何拆分我的数据集进行训练,因为这些类没有相等的分布集? 选项 1 :考虑整个数据[(302,100, 5) - both classes (0 & 1)] ,对其进行洗牌,train_test_split,继续训练。 选项 2:将两个类数据集均分[(95,100,5) - class 0 & (95,100,5) - class 1] ,对其进行洗牌,train_test_split,继续训练。 在训练之前进行拆分的更好方法是什么,以便我可以在减少损失、准确性、预测方面获得更好的结果? 如果有其他选项而不是以上 2 个选项,请推荐, 根据评论部分,我包含了我的一部分数据: X_train : 形状 (241 * 100 * 5) 每100*5中的每一行对应1个时间步最后100行对应100个时间步,单位为毫秒(ms) array([[[0.98620635, 0. , 0.12752912, 0.60897341, 0.46903766], [0.97345112, 0. , 0.12752912

2021-12-05 04:55:08    分类:技术分享    python   machine-learning   lstm   train-test-split

应该在训练测试拆分之前还是之后进行特征选择?(Should Feature Selection be done before Train-Test Split or after?)

问题 实际上,有两个事实的矛盾是问题的可能答案: 传统的答案是在拆分后进行,因为如果之前进行过,则可能会从测试集泄漏信息。 矛盾的答案是,如果仅从整个数据集中选择的训练集用于特征选择,那么特征选择或特征重要性得分顺序可能会随着 Train_Test_Split 的 random_state 的变化而动态变化。 如果任何特定工作的特征选择发生变化,则无法进行特征重要性的泛化,这是不可取的。 其次,如果仅使用训练集进行特征选择,则测试集可能包含某些实例集,这些实例与仅在训练集上进行的特征选择相悖/矛盾,因为未分析整体历史数据。 此外,只有在给定一组实例而不是单个测试/未知实例时,才能评估特征重要性分数。 回答1 常规答案 #1 在这里是正确的; 矛盾的答案#2 中的论点实际上并不成立。 当有这样的疑问时,想象一下你在模型拟合过程中根本没有任何测试集的访问权(包括特征重要性)是有用的; 您应该将测试集视为字面上看不见的数据(并且,由于看不见,它们不能用于特征重要性分数)。 Hastie 和 Tibshirani 很久以前就明确地争论过执行此类过程的正确和错误方法; 我在博客文章中总结了这个问题,如何不执行特征选择! - 尽管讨论是关于交叉验证的,但可以很容易地看出,这些论点也适用于训练/测试拆分的情况。 在您的矛盾答案#2 中实际成立的唯一论点是 不分析整体历史数据 尽管如此

2021-12-05 03:26:56    分类:技术分享    machine-learning   feature-selection   train-test-split

Splitting data using time-based splitting in test and train datasets

I know that train_test_split splits it randomly, but I need to know how to split it based on time. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # this splits the data randomly as 67% test and 33% train How to split the same data set based on time as 67% train and 33% test? The dataset has a column TimeStamp. I tried searching on the similar questions but was not sure about the approach. Can someone explain briefly?

2021-11-27 20:04:43    分类:问答    python   scikit-learn   timestamp   train-test-split

ImportError: cannot import name 'LatentDirichletAllocation'

I'm trying to import the following: from sklearn.model_selection import train_test_split and got following error, here's the stack trace : ImportError Traceback (most recent call last) <ipython-input-1-bdd2a2f20673> in <module> 2 import pandas as pd 3 from sklearn.model_selection import train_test_split ----> 4 from sklearn.tree import DecisionTreeClassifier 5 from sklearn.metrics import accuracy_score 6 from sklearn import tree ~/.local/lib/python3.6/site-packages/sklearn/tree/__init__.py in <module> 4 """ 5 ----> 6 from ._classes import BaseDecisionTree 7 from ._classes import

2021-11-27 01:02:01    分类:问答    python   python-3.x   scikit-learn   sklearn-pandas   train-test-split

dimension mismatch error in CountVectorizer MultinomialNB

Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and predict on test set. Here is my code (simplified): from sklearn.feature_extraction.text import

2021-11-25 05:23:40    分类:问答    python   naivebayes   countvectorizer   train-test-split

What is the correct procedure to split the Data sets for classification problem?

I am new to Machine Learning & Deep Learning. I would like to clarify my doubt related to train_test_split before training I have a data set of size (302, 100, 5), where, (207,100,5) belongs to class 0 (95,100,5) belongs to class 1. I would like to perform Classification using LSTM (since, sequence Data) How can i split my data set for training, since the classes do not have equal distribution sets ? Option 1 : Consider whole data [(302,100, 5) - both classes (0 & 1)], shuffle it, train_test_split, proceed training. Option 2 : Split both class data set equally [(95,100,5) - class 0 & (95,100,5

2021-11-21 02:51:12    分类:问答    python   machine-learning   lstm   train-test-split

Should Feature Selection be done before Train-Test Split or after?

Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is that, if only the Training Set chosen from the whole dataset is used for Feature Selection, then the feature selection or feature importance score orders is likely to be dynamically changed with change in random_state of the Train_Test_Split. And if the feature selection for any particular work changes, then no Generalization of Feature Importance can

2021-11-14 13:21:46    分类:问答    machine-learning   feature-selection   train-test-split