天道酬勤,学无止境

boyer-moore

Seeking Unicode-savvy function for searching text in binary data

I need to find unicode text inside binary data (files). I'm seeking any C or C++ code or library that I can use on macOS. Since I guess this is also useful to other platforms, so I rather make this question not specific to macOS. On macOS, the NSString functions, meeting my unicode savvyness needs, can't be used because they do not work on binary data. As an alternative I've tried the POSIX complient regex functions provided on macOS, but they have some limitations: They are not normalization-savvy, i.e. if I search for a precomposed (NFC) character, it won't find the characher if it's

2022-02-18 06:23:12    分类:问答    regex   search   boyer-moore

Boyer-Moore 在 C# 中实用吗?(Boyer-Moore Practical in C#?)

问题 Boyer-Moore 可能是已知最快的非索引文本搜索算法。 所以我在我的 Black Belt Coder 网站上用 C# 实现它。 我让它工作了,与String.IndexOf()相比,它大致显示了预期的性能改进。 但是,当我将StringComparison.Ordinal参数添加到IndexOf ,它的性能开始优于我的 Boyer-Moore 实现。 有时,数量可观。 我想知道是否有人可以帮我找出原因。 我明白为什么StringComparision.Ordinal可能会加快速度,但它怎么会比 Boyer-Moore 更快? 是因为 .NET 平台本身的开销,还是因为必须验证数组索引以确保它们在范围内,或者其他原因。 某些算法在 C#.NET 中不实用吗? 下面是关键代码。 // Base for search classes abstract class SearchBase { public const int InvalidIndex = -1; protected string _pattern; public SearchBase(string pattern) { _pattern = pattern; } public abstract int Search(string text, int startIndex); public int Search

2021-12-03 00:47:37    分类:技术分享    c#   .net   algorithm   performance   boyer-moore

Boyer-Moore Practical in C#?

Boyer-Moore is probably the fastest non-indexed text-search algorithm known. So I'm implementing it in C# for my Black Belt Coder website. I had it working and it showed roughly the expected performance improvements compared to String.IndexOf(). However, when I added the StringComparison.Ordinal argument to IndexOf, it started outperforming my Boyer-Moore implementation. Sometimes, by a considerable amount. I wonder if anyone can help me figure out why. I understand why StringComparision.Ordinal might speed things up, but how could it be faster than Boyer-Moore? Is it because of the the

2021-11-21 05:14:47    分类:问答    c#   .net   algorithm   performance   boyer-moore

如何用Hadoop实现字符串匹配算法?(How to implement string matching algorithm with Hadoop?)

问题 我想使用 Hadoop 实现字符串匹配(Boyer-Moore)算法。 我刚开始使用 Hadoop,所以我不知道如何用 Java 编写 Hadoop 程序。 到目前为止,我看到的所有示例程序都是字数统计示例,我找不到任何用于字符串匹配的示例程序。 我尝试搜索一些教如何使用 Java 编写 Hadoop 应用程序的教程,但找不到任何教程。 你能给我推荐一些教程,让我可以学习如何使用 Java 编写 Hadoop 应用程序。 提前致谢。 回答1 我还没有测试下面的代码,但这应该会让你开始。 我使用了此处提供的 BoyerMoore 实现 下面的代码在做什么: 目标是在输入文档中搜索模式。 BoyerMoore 类在 setup 方法中使用配置中设置的模式进行初始化。 映射器一次接收每一行,并使用 BoyerMoore 实例来查找模式。 如果找到匹配,我们使用上下文编写它。 这里不需要减速器。 如果在不同的映射器中多次找到该模式,则输出将具有多个偏移量(每个映射器 1 个)。 package hadoop.boyermoore; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache

2021-11-20 18:46:37    分类:技术分享    java   hadoop   string-matching   boyer-moore

How to implement string matching algorithm with Hadoop?

I want to implement a string matching(Boyer-Moore) algorithm using Hadoop. I just started using Hadoop so I have no idea how to write a Hadoop program in Java. All the sample programs that I have seen so far are word counting examples and I couldn't find any sample programs for string matching. I tried searching for some tutorials that teaches how to write Hadoop applications using Java but couldn't find any. Can you suggest me some tutorials where I can learn how to write Hadoop applications using Java. Thanks in advance.

2021-11-13 13:01:54    分类:问答    java   hadoop   string-matching   boyer-moore

StringUtils.contains 的 Apache 和 Boyer–Moore 字符串搜索算法(StringUtils.contains of Apache and Boyer–Moore string search algorithm)

问题 要在 S 中搜索 s(size(S) >= size(s) 并返回真/假值),最好使用 Apache 的 StringUtils.contains() 或使用由我找到的人? 谢谢 回答1 上次我在调试时查看 Java 正则表达式匹配代码时,Java 7 正则表达式引擎使用 Boyer-Moore 算法进行文本匹配序列。 因此,使用 Boyer-Moore 查找String的最简单方法是使用p=Pattern.compile(searchString, Pattern.LITERAL)进行准备并使用p.matcher(toSearchOn).find() 。 无需第三方库,也无需手工制作。 而且我相信 JRE 类经过了很好的测试…… 回答2 Apache Lang 使用 Java API 的区域匹配来实现其包含。 很难说哪个表面上更快。 听起来像是构建一个简单的测试用例并以两种方式运行它并查看的机会。

2021-08-12 12:17:28    分类:技术分享    java   string   algorithm   boyer-moore

Do I have to take the encoding into account when performing Boyer-Moore pattern matching?

I'm about to implement a variation of the Boyer-Moore pattern matching algorithm (the Sunday algorithm to be specific) and I was asking myself: What is my alphabet size? Does it depend on the encoding (= number of possible characters) or can I just assume my alphabet consists of 256 symbols (= number of symbols which can be represented by a byte)? In many other situations treating characters as bytes would be a problem because depending on the encoding a character can consist of multiple bytes, but if in my case both strings have the same encoding then equal characters are represented by equal

2021-08-02 16:30:14    分类:问答    string   character-encoding   pattern-matching   boyer-moore

StringUtils.contains of Apache and Boyer–Moore string search algorithm

To search for s in S (size(S) >= size(s) and return a true/false value), it's better for performance to use StringUtils.contains() of Apache or use Boyer-Moore algorithm implemented and tested well by someone I found? Thanks

2021-07-31 18:15:58    分类:问答    java   string   algorithm   boyer-moore

原始 Boyer-Moore 和 Boyer-Moore-Horspool 算法之间的区别 [关闭](Difference between original Boyer–Moore and Boyer–Moore–Horspool Algorithm [closed])

问题 关闭。 这个问题需要更加集中。 它目前不接受答案。 想改善这个问题吗? 更新问题,使其仅通过编辑这篇文章来关注一个问题。 4年前关闭。 改进这个问题 我无法理解 Horspool 在他的算法中所做的更改。 如果您有 Boyer-Moore-Horspool 算法的任何链接,请告诉我。 回答1 以下是我的一些观察: BM: Preprocessing complexity: Θ(m + σ) Worst Case : Θ(nm) If pattern exists Θ(n+m) If pattern doesn't exist" Best Case : Θ(n/m) Space: Θ(σ) Comparisions: Θ(3n) Preprocessing: Uses Good Suffix and Bad Character Shift. At every step, it slides the pattern by the max of the slides suggested by the two heuristics. So it uses best of the two heuristics at every step. Boyer Moore algorithm uses the "bad" text character itself to determine

2021-07-28 13:54:24    分类:技术分享    algorithm   pattern-matching   boyer-moore