天道酬勤,学无止境

label-encoding

Use same category labeling criteria on two different dataframes

I have a dataFrame that contains a categorical feature which i have encoded in the following way: df['categorical_feature'] = df['categorical_feature'].astype('category') df['labels'] = df['categorical_feature'].cat.codes If I apply the same code as above on another dataFrame with same category field the mapping is shuffled, but i need it to be consistent with the first dataFrame. Is there a way to successfully apply the same mapping category:label to another dataFrame that has the same categorical values?

2022-05-02 04:19:25    分类:问答    pandas   label-encoding

编码分类变量,如“状态名称”(Encoding Categorical Variables like "State Names")

问题 我有一个带有“州名”的分类列。 我不确定我必须执行哪种类型的分类编码才能将它们转换为数字类型。 有 83 个唯一的州名。 标签编码器用于有序分类变量,但 OneHot 会增加列数,因为有 83 个唯一的状态名称。 还有什么我可以尝试的吗? 回答1 我会使用 scikit 的 OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) 或 CategoricalEncoder 编码设置为“onehot”。 它会自动找到每个特征的唯一值并将其处理为一个热向量。 它确实增加了该特征的输入维度,但如果您正在从事任何类型的数据科学工作,这是必要的。 如果您将特征转换为有序整数(即只有一个整数)而不是二进制值的向量,则算法可能会在两个(可能完全独立的)分类值之间得出错误的结论,而这两个分类值恰好在分类空间中靠在一起. 回答2 除了一个 hot 之外,还有其他强大的编码方案,它们不会增加列数。 您可以尝试以下操作(按复杂程度递增): count encoding :按类别在数据中出现的次数对每个类别进行编码,在某些情况下很有用。 例如,如果您想对纽约是大城市的信息进行编码,那么数据中 NY 的计数确实包含该信息,因为我们预计 NY

2021-10-25 23:11:15    分类:技术分享    python   categorical-data   one-hot-encoding   label-encoding

Encoding Categorical Variables like "State Names"

I have a Categorical column with 'State Names'. I'm unsure about which type of Categorical Encoding I'll have to perform in order to convert them to Numeric type. There are 83 unique State Names. Label Encoder is used for ordinal categorical variables, but OneHot would increase the number of columns since there are 83 unique State names. Is there anything else I can try?

2021-10-24 10:28:53    分类:问答    python   categorical-data   one-hot-encoding   label-encoding