天道酬勤,学无止境

R: Fuzzy merge using agrep and data.table

I try to merge two data.tables, but due to different spelling in stock names I lose a substantial number of data points. Hence, instead of an exact match I was looking into a fuzzy merge.

library("data.table")
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))

When merging dt1 and dt2 on "Name", ASML HOLDING will be excluded due to the addition of "NV", while the actual data would be accurate.

The prefered final data output would look somthing like:

              Name A B
1:  ABN AMRO GROUP 2 q
2: ASML HOLDING NV 1 p

What I tried next was the following:

dt1 = dt1[, dt2_NAME := agrep(dt1$Name, dt2$Name, ignore.case = TRUE, value = TRUE, max.distance = 0.05, useBytes = TRUE)]

However, I get the following error,

argument 'pattern' has length > 1 and only the first element will be used

The error makes sense as dt1$Name is longer than 1, but I believe it would be a possible solution if it would consider dt1$Name on a row to row basis.

It might be a stupid mistake, but for some reason I just can't get my head around it. Furthermore, I prefer to use data.table as my dataset is fairly large and up till now it has worked splendidly. Additionally, I am new to stack overflow, so sorry if my question is somewhat off.

Lastly, I found a piece of code which does the job, but is too slow for practical usage. Fuzzy merge in R

dt1$Name_dt2 <- "" # Creating an empty column
for(i in 1:dim(dt1)[1]) {
  x <- agrep(dt1$Name[i], dt2$Name,
             ignore.case=TRUE, value=TRUE,
             max.distance = 0.05, useBytes = TRUE)
  x <- paste0(x,"")
  dt1$Name_dt2[i] <- x
}
标签

评论

A possible solution using 'fuzzyjoin':

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
                                   ignore.case=TRUE,
                                   max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
#          Name.x A          Name.y B
#1   ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2  ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize.

This method can be used together with an equi-join (mind the order of columns in the by!):

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • 使用agrep()对多种模式进行模糊名称匹配的R代码更快...?(Faster R code for fuzzy name matching using agrep() for multiple patterns…?)
    问题 我是R的新手,并且一直在尝试使用R中的agrep函数进行一些实验。我有大量的客户数据库(150万行),我确信其中有很多重复项。 但是,许多重复项并未使用table()来揭示重复的确切名称的频率。 只是盯着某些行,我注意到许多重复项是“唯一的”,因为名字的拼写有一个小的错位键。 到目前为止,要查找数据集中的所有重复项,我一直在使用agrep()完成模糊名称匹配。 我一直在玩agrep()的max.distance参数来返回不同的近似匹配项。 我认为我已经找到了在返回假阳性和错过真实比赛之间找到快乐的方法。 由于agrep()只能匹配一个模式,因此我能够在堆栈溢出中找到一个条目,以帮助我编写一个sapply代码,使我可以将数据集与众多模式进行匹配。 这是我用来遍历众多模式的代码,这些代码遍历我的数据集以进行“重复”。 dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name))) unique$name=这是我开发的唯一索引,具有我希望在数据集中寻找的所有“模式”。 vf$name= is the column in my data frame that contains all of my customer names
  • Faster R code for fuzzy name matching using agrep() for multiple patterns…?
    I'm a bit of an R novice and have been trying to experiment a bit using the agrep function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name. So far, to find all of the duplicates in my data set, I have been using agrep() to accomplish the fuzzy name matching. I have been playing
  • Merging through fuzzy matching of variables in R
    I have two dataframes (x & y) where the IDs are student_name, father_name and mother_name. Because of typographical errors ("n" instead of "m", random white spaces, etc.), I have about 60% of values which are not aligning, though I can eyeball the data and see they should. Is there a way to reduce the level of non-match somehow so that manually editing because at least feasible? The dataframes are have about 700K observations. R would be best. I know a little bit of python, and some basic unix tools. P.S. I read up on agrep(), but don't understand how that can work on actual datasets
  • R合并数据帧,允许不精确的ID匹配(例如,与其他字符1234匹配ab1234)(R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 ))
    问题 我正在尝试处理一些非常混乱的数据。 我需要通过样本ID合并两个包含不同类型数据的大型数据框。 问题在于,一个表的样本ID具有多种不同的格式,但是大多数表的ID包含用于匹配其ID的所需ID字符串,例如,一个表中的样本“ 1234”在另一表中具有ID“ ProjectB(1234)” 。 我做了一个最小的可重现的例子。 a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean")) b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee")) 使用合并成为其中一部分: merge(a,b, by.x="aID", by.y="bID", all=TRUE) aID aInfo bInfo 1 1234 blue <NA> 2 3645 cerulean <NA> 3 4567 green apple 4 6789 goldenrod kiwi 5 (1234) <NA> banana 6 23645 <NA> pomegranate 7 63528973 <NA>
  • 在R中使用模糊/近似字符串匹配合并两个数据帧(Merging two Data Frames using Fuzzy/Approximate String Matching in R)
    问题 描述 我有两个数据集,其中包含我需要合并的信息。 我仅有的常见字段是不完全匹配的字符串和可能完全不同的数字字段 解释问题的唯一方法是向您显示数据。 这是a.csv和b.csv。 我正在尝试将B合并为A。 B中有三个字段,A中有四个字段。公司名称(仅文件A),基金名称,资产类别和资产。 到目前为止,我的重点一直是尝试通过替换单词或字符串的一部分以创建完全匹配项,然后使用以下方法来匹配基金名称: a <- read.table(file = "http://bertelsen.ca/R/a.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T) b <- read.table(file = "http://bertelsen.ca/R/b.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T) merge(a,b, by="Fund.Name") 但是,这只能使我达到30%的匹配率。 剩下的我要手工做。 资产是一个数值字段,在任何一个字段中都不总是正确的,并且如果基金的资产较低,则可能会发生巨大变化。
  • 模糊正则表达式(Fuzzy Regular Expressions)
    问题 在我的工作中,我获得了很好的结果,使用了近似的字符串匹配算法(例如Damerau–Levenshtein距离),使我的代码更不容易出现拼写错误。 现在,我需要将字符串与简单的正则表达式匹配,例如TV Schedule for \d\d (Jan|Feb|Mar|...) 。 这意味着字符串TV Schedule for 10 Jan应返回0,而T Schedule for 10. Jan应返回2。 这可以通过在正则表达式中生成所有字符串(在本例中为100x12)并找到最佳匹配项来完成,但这并不实际。 您有任何想法如何有效地做到这一点吗? 回答1 我找到了TRE库,它似乎能够对正则表达式进行精确的模糊匹配。 示例:http://hackerboss.com/approximate-regex-matching-in-python/尽管它仅支持插入,删除和替换。 没有换位。 但是我想这行得通。 我在以下文件上尝试了带有regexp的随附的agrep工具: TV Schedule for 10Jan TVSchedule for Jan 10 T Schedule for 10 Jan 2010 TV Schedule for 10 March Tv plan for March 并得到了 $ agrep -s -E 100 '^TV Schedule for \d\d (Jan
  • Find Match of two data frames and rewrite the answer as data frame
    i have two data frames which are cleaned and merged as a single csv file , the data frames are like this **Source Master** chang chun petrochemical CHANG CHUN GROUP chang chun plastics CHURCH AND DWIGHT CO INC church dwight CITRIX SYSTEMS ASIA PACIFIC P L citrix systems pacific CNH INDUSTRIAL N.V now from these , i have to consider the first name and check with each name of master names and find a match that is relevant and print the output as another data frame. the above data frames are few , but i am working with 20k values as such. My output must look like this **Source Master Result**
  • How can I match fuzzy match strings from two datasets?
    I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS. So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the
  • R:我必须在字符串中进行软匹配(R: I have to do Softmatch in String)
    问题 我必须使用给定的输入字符串在数据帧的一列中进行软匹配,例如 col <- c("John Collingson","J Collingson","Dummy Name1","Dummy Name2") inputText <- "J Collingson" #Vice-Versa inputText <- "John Collingson" 我想从提供的名称“ col”中同时检索“ John Collingson”和“ J Collingson” 请帮助 回答1 如果您只有一点点数据,那么agrep绝对是一种快速简便的基础R解决方案。 如果这只是较大数据框的玩具示例,则您可能对更耐用的工具感兴趣。 在过去的一个月中,了解到@PaulHiemstra指出的Levenshtein距离(也在这些不同的问题中)使我进入了RecordLinkage软件包。 小插曲让我想要更多“软”或“模糊”匹配的示例,尤其是在多个字段中,但您问题的基本答案可能是: library(RecordLinkage) col <- data.frame(names1 = c("John Collingson","J Collingson","Dummy Name1","Dummy Name2")) inputText <- data.frame(names2 = c("J Collingson")) g1 <
  • Fuzzy merging in R - seeking help to improve my code
    Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the string distance matrices, if there are multiple fuzzy variables, and not the computation of the string
  • 查找两个数据帧的匹配并将答案重写为数据帧(Find Match of two data frames and rewrite the answer as data frame)
    问题 我有两个数据帧,它们被清理并合并为单个csv文件,数据帧如下所示 **Source Master** chang chun petrochemical CHANG CHUN GROUP chang chun plastics CHURCH AND DWIGHT CO INC church dwight CITRIX SYSTEMS ASIA PACIFIC P L citrix systems pacific CNH INDUSTRIAL N.V 现在,从这些中,我必须考虑名字,并与主名称的每个名称进行核对,并找到相关的匹配项,并将输出打印为另一个数据框。 上面的数据帧很少,但我正在使用20k值。 我的输出必须看起来像这样 **Source Master Result** chang chun petrochemical CHANG CHUN GROUP CHANG CHUN GROUP chang chun plastics CHURCH AND DWIGHT CO INC CHANG CHUN GROUP church dwight CITRIX SYSTEMS ASIA PACIFIC P L CHURCH AND DWIGHT CO INC citrix systems pacific CNH INDUSTRIAL N.V CITRIX SYSTEMS ASIA
  • 如何匹配来自两个数据集的模糊匹配字符串?(How can I match fuzzy match strings from two datasets?)
    问题 我一直在研究一种基于不完善的字符串(例如公司名称)来联接两个数据集的方法。 过去,我必须匹配两个非常脏的列表,一个列表包含名称和财务信息,另一个列表包含名称和地址。 都没有唯一的ID可以匹配! 假设已经应用了清洁,并且可能存在打字和插入错误。 到目前为止,AGREP是我发现的最有效的工具。 我可以在AGREP包中使用levenshtein距离,该距离用于测量两个字符串之间的删除,插入和替换的数量。 AGREP将返回距离最小(最相似)的字符串。 但是,我一直无法将命令从单个值转换为将其应用于整个数据帧。 我已经粗略地使用了for循环来重复AGREP函数,但是总有一种更简单的方法。 请参见以下代码: a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1)) b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10)) for (i in 1:6){ a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) a$Y[i] = agrep(a$name[i]
  • 使用data.table的字符串匹配不完美(Imperfect string match using data.table)
    问题 好的,所以不久前我发布了一个问题,该问题涉及编写R函数来加速大型文本文件的字符串匹配。 我睁大了眼睛对“ data.table”,我的问题得到了很好的回答。 这是该线程的链接,其中包括所有数据和详细信息: 加快R中字符串匹配的性能和速度 但是现在我遇到了另一个问题。 有时,提交的VIN#(在'vinDB'文件中)与'carFile'文件中的VIN#会相差一两个字符,这是由于人为错误导致他们在DMV上填写自己的汽车信息时出现的。 有没有办法编辑 dt[J(car.vins), list(NumTimesFound=.N), by=vin.names] 该代码的行(由以上链接中的@BrodieG提供)以允许识别相差一两个字符的VIN#? 抱歉,这很容易纠正。 我只是对R中的“ data.table”包的功能感到不知所措,并且希望尽可能多地学习其实用程序,并且该论坛的知识渊博的成员对我来说绝对至关重要。 **编辑: 因此,我一直在按建议使用'lapply'和'agrep'函数,我必须做错了什么: 我尝试替换此行: dt[J(car.vins), list(NumTimesFound=.N), by=vin.names] 有了这个: dt <- dt[lapply(vin.vins, function(x) agrep(x,car.vins, max.distance=2))
  • Imperfect string match using data.table
    Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to 'data.table' and my question was answered perfectly. This is the link to that thread which includes all of the data and details: Accelerate performance and speed of string match in R But now I am running into another problem. Once in a while, the submitted VIN#s (in the 'vinDB' file) differ by one or two characters in the 'carFile' file due to human error when they fill out their car info at the DMV. Is there a way to edit the dt[J(car.vins), list
  • Doing a “fuzzy” and non-fuzzy, many to 1 merge with data.table
    Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year. dfA <- read.table( text = "A B C D E F G iso year matchcode 1 0 1 1 1 0 1 0 NLD 2010 NLD2010 2 1 0 0 0 1 0 1 NLD 2014 NLD2014 3 0 0 0 1 1 0 0 AUS 2010 AUS2010 4 1 0 1 0 0 1 0 AUS 2006 AUS2006 5 0 1 0 1 0 1 1 USA 2008 USA2008 6 0 0 1 0 0 0 1 USA 2010 USA2010 7 0 1 0 1 0 0 0
  • 合并数据与 r 中的部分匹配(merge data with partial match in r)
    问题 我有两个数据集 datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh", "Kish", "Lissp", "Kcn", "CCCa"), number1 = c(1, 8, 9, 2, 18, 25, 33, 8)) #----------- name number1 1 regular 1 2 kklmin 8 3 notSo 9 4 Jijoh 2 5 Kish 18 6 Lissp 25 7 Kcn 33 8 CCCa 8 datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp", "KcN", "CaPN"), number2 = c(2, 8, 12, 13, 20, 18, 13)) #------------- name number2 1 reGulr 2 2 ntSo 8 3 Jijoh 12 4 sean 13 5 LiSsp 20 6 KcN 18 7 CaPN 13 我想按名称列合并它们,但是允许部分匹配(以避免妨碍合并大数据集中的拼写错误,甚至检测此类拼写错误),例如 (1) 如果在任何位置连续四个字母(如果字母数小于 4 则全部) - 匹配即可 ABBCD = BBCDK =
  • 高效的字符串相似性分组(Efficient string similarity grouping)
    问题 设置:我有关于人及其父母姓名的数据,并且我想找到兄弟姐妹(父母姓名相同的人)。 pdata<-data.frame(parents_name=c("peter pan + marta steward", "pieter pan + marta steward", "armin dolgner + jane johanna dough", "jack jackson + sombody else")) 此处的预期输出将是一列,指示前两个观察值属于X族,而第三和第四列分别位于单独的族中。 例如: person_id parents_name family_id 1 "peter pan + marta steward", 1 2 "pieter pan + marta steward", 1 3 "armin dolgner + jane johanna dough", 2 4 "jack jackson + sombody else" 3 当前方法:关于距离度量,我很灵活。 目前,我使用Levenshtein编辑距离来匹配obs,允许两个字符的差异。 但是其他变体,例如“最大的公共子字符串”,如果运行得更快,则可以。 对于较小的子样本,我在循环中使用stringdist::stringdist或stringdist::stringdistmatrix ,但是随着样本大小的增加
  • 如何包括管道| 在我的linux中找到-exec命令?(How do I include a pipe | in my linux find -exec command?)
    问题 这不起作用。 可以在查找中完成吗? 还是我需要xargs? find -name 'file_*' -follow -type f -exec zcat {} \| agrep -dEOE 'grep' \; 回答1 将管道符号解释为运行多个进程并将一个进程的输出通过管道传递到另一个进程的输入的指令是外壳程序的职责(/ bin / sh或等效命令)。 在您的示例中,您可以选择使用顶级外壳执行管道,如下所示: find -name 'file_*' -follow -type f -exec zcat {} \; | agrep -dEOE 'grep' 就效率而言,此结果花费了find的一次调用,zcat的多次调用和agrep的一次调用。 这将导致仅产生单个agrep进程,该进程将处理由zcat的多次调用产生的所有输出。 如果出于某种原因您想多次调用agrep,则可以执行以下操作: find . -name 'file_*' -follow -type f \ -printf "zcat %p | agrep -dEOE 'grep'\n" | sh 这使用管道构造了要执行的命令列表,然后将这些命令发送到新的外壳以实际执行。 (省略最后的“ | sh”是调试或执行类似这样的命令行的不错的方法。) 就效率而言,此结果花费了find的一次调用,sh的一次调用
  • 在R中基于grep替换数据值(Replacing data values based on grep result in R)
    问题 我有一个数据框。 列之一的值如下: 风风大风等等 在其他值中。 现在,我想使用“ WIND”来重命名每个带有“ WIND”变体的值。 我知道如何找到需要替换的值: grep("WIND", df$col1) 但不是如何替换这些值。 谢谢。 回答1 您可以使用grepl并替换掉这些值的原始列 df$col1[grepl("WIND",df$col1)]<-"WIND" 回答2 更新:有点brainfart的, agrep居然在这里不添加任何东西在grep的,但你只需更换agrep用grep 。 如果您有一些词根稍有不同但仍要匹配的词,它会起作用。 这是使用agrep的方法: > wind.vec [1] "WINDS" "HIGH WIND" "WINDY" "VERY WINDY" > wind.vec[agrep("WIND", wind.vec)] <- "WIND" > wind.vec [1] "WIND" "WIND" "WIND" "WIND" 关于agrep是它大致匹配,因此将“ WINDY”替换为。 请注意,我正在使用矢量进行此操作,但是您可以通过将wind.vec替换为my.data.frame$my.wind.col来轻松扩展到数据帧。 agrep返回近似匹配的索引,然后允许我使用[<-替换运算符将近似匹配的值替换为“ WIND”。
  • Merging two Data Frames using Fuzzy/Approximate String Matching in R
    DESCRIPTION I have two datasets with information that I need to merge. The only common fields that I have are strings that do not perfectly match and a numerical field that can be substantially different The only way to explain the problem is to show you the data. Here is a.csv and b.csv. I am trying to merge B to A. There are three fields in B and four in A. Company Name (File A Only), Fund Name, Asset Class, and Assets. So far, my focus has been on attempting to match the Fund Names by replacing words or parts of the strings to create exact matches and then using: a <- read.table(file =