天道酬勤,学无止境

writing a custom anaylzer in pylucene/inheritance using jcc?

I want to write a custom analyzer in pylucene. Usually in java lucene , when you write a analyzer class , your class inherits lucene's Analyzer class.

but pylucene uses jcc , the java to c++/python compiler.

So how do you let a python class inherit from a java class using jcc ,and especially how do you write a custom pylucene analyzer?

Thanks.

评论

Here's an example of an Analyzer that wraps the EdgeNGram Filter.

import lucene
class EdgeNGramAnalyzer(lucene.PythonAnalyzer):
    '''
    This is an example of a custom Analyzer (in this case an edge-n-gram analyzer)
    EdgeNGram Analyzers are good for type-ahead
    '''

    def __init__(self, side, minlength, maxlength):
        '''
        Args:
            side[enum] Can be one of lucene.EdgeNGramTokenFilter.Side.FRONT or lucene.EdgeNGramTokenFilter.Side.BACK
            minlength[int]
            maxlength[int]
        '''
        lucene.PythonAnalyzer.__init__(self)
        self.side = side
        self.minlength = minlength
        self.maxlength = maxlength

    def tokenStream(self, fieldName, reader):
        result = lucene.LowerCaseTokenizer(Version.LUCENE_CURRENT, reader)
        result = lucene.StandardFilter(result)
        result = lucene.StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
        result = lucene.ASCIIFoldingFilter(result)
        result = lucene.EdgeNGramTokenFilter(result, self.side, self.minlength, self.maxlength)
        return result

Here's another example of re-implementing PorterStemmer

# This sample illustrates how to write an Analyzer 'extension' in Python.
# 
#   What is happening behind the scenes ?
#
# The PorterStemmerAnalyzer python class does not in fact extend Analyzer,
# it merely provides an implementation for Analyzer's abstract tokenStream()
# method. When an instance of PorterStemmerAnalyzer is passed to PyLucene,
# with a call to IndexWriter(store, PorterStemmerAnalyzer(), True) for
# example, the PyLucene SWIG-based glue code wraps it into an instance of
# PythonAnalyzer, a proper java extension of Analyzer which implements a
# native tokenStream() method whose job is to call the tokenStream() method
# on the python instance it wraps. The PythonAnalyzer instance is the
# Analyzer extension bridge to PorterStemmerAnalyzer.

'''
More explanation... 
Analyzers split up a chunk of text into tokens...
Analyzers are applied to an index globally (unless you use perFieldAnalyzer)
Analyzers implement Tokenizers and TokenFilters.
Tokenizers break up string into tokens. TokenFilters break of Tokens into more Tokens or filter out
Tokens
'''

import sys, os
from datetime import datetime
from lucene import *
from IndexFiles import IndexFiles


class PorterStemmerAnalyzer(PythonAnalyzer):

    def tokenStream(self, fieldName, reader):

        #There can only be 1 tokenizer in each Analyzer
        result = StandardTokenizer(Version.LUCENE_CURRENT, reader)
        result = StandardFilter(result)
        result = LowerCaseFilter(result)
        result = PorterStemFilter(result)
        result = StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)

        return result


if __name__ == '__main__':
    if len(sys.argv) < 2:
        sys.exit("requires at least one argument: lucene-index-path")
    initVM()
    start = datetime.now()
    try:
        IndexFiles(sys.argv[1], "index", PorterStemmerAnalyzer())
        end = datetime.now()
        print end - start
    except Exception, e:
        print "Failed: ", e

Checkout perFieldAnalyzerWrapper.java also KeywordAnalyzerTest.py

        analyzer = PerFieldAnalyzerWrapper(SimpleAnalyzer())
        analyzer.addAnalyzer("partnum", KeywordAnalyzer())

        query = QueryParser(Version.LUCENE_CURRENT, "description",
                            analyzer).parse("partnum:Q36 AND SPACE")
        scoreDocs = self.searcher.search(query, 50).scoreDocs

You can inherit from any class in pylucene, but the ones with names that start with Python will also extend the underlying Java class, i.e., make the relevant methods "virtual" when called from java code. So in the case of custom analyzers, inherit from PythonAnalyzer and implement the tokenStream method.

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • Attempting to inline Java into Perl via the Inline::Java module
    This is my first attempt to inline Java code in Perl. We cannot use the standard SFTP command on our system. This is out of my power. We have a jarfile called SFTP.jar which can be used. The previous person before me was able to get Inline::Perl to work, but his implementation was sloppy, and I'd like to clean it up. I am working on a Windows system on the H:\svn directory. I have my module under the H:\svn\FMS3 directory, and I have a jarfile called SFTP.jar under the H:\svn\FMS3\Sftp.pm directory. There is a file called Sftp.pm located under the H:\svn\FMS3 directory, and defines a module
  • Jmeter连接DB2/ORACLE/MYSQL数据库
    连接DB21、将db2数据库驱动db2java.jar、db2jcc.jar放入jmeter的lib/下,同时也要放入本地jdk目录下例如:C:\Program Files\Java\jdk1.7.0_751\jre\lib\extdb2安装目录下以linux为例/安装的目录/db2admin/sqllib/java,db2java驱动原本为zip格式,需要传入本地后改为.jar格式2、在Jmeter中添加JDBC配置文件(JDBC Connection Configuration)路径:右键添加——配置文件——JDBC Connection Configuration3、在JDBC Connection Configuration中设置连接,Database URL:填入需要连接的MYSQL数据库例如:jdbc:db2://localhost:3306/testlocalhost为ip,3306为端口号,test为连接的数据库(如果需要一个请求执行多条Sql语句应该写成jdbc:db2://localhost:3306/test?user=root&password=&allowMultiQueries=true)JDBC Driver class:com.ibm.db2.jcc.DB2Driver(JCC表示通过DB2jcc驱动连接)Username 与Password
  • 通过jdbc连接到eclipse中的DB2数据库(Connect to DB2 database in eclipse via jdbc)
    问题 我正在尝试通过JDBC连接到具有Eclipse(版本Juno)的IBM DB2数据库。 我已将驱动程序(外部jar文件)添加到我的项目中,并且驱动程序已正确加载... public static void main(String[] args) throws SQLException, ClassNotFoundException { Class.forName("com.ibm.db2.jcc.DB2Driver"); System.out.println("Driver loaded"); Connection dbConn = DriverManager.getConnection("jdbc:db2://***.**.***.*:50000/BWUEBDB", "username", "password"); System.out.println("Connected"); } 我也知道连接数据(数据库路径,用户名,密码)是正确的。 但是我得到了一个java.lang.NoClassDefFoundError: Exception in thread "main" java.lang.NoClassDefFoundError: sun/io/UnknownCharacterException at com.ibm.db2.jcc.b.a.<init>(a.java
  • No runtime on my Worklight 6.2 Console after installing analytics
    I just installed Worklight 6.2 server, use the configuration tool to deploy a simple project and it works OK. Then I followed the instructions in knowledge center to install analytics. Afterwards, when I open WL console in the browser, I got "No runtime can be found." Analytics seems to work fine My war file is in Liberty apps folder. in my case C:\IBM\WebSphere\Liberty\usr\servers\simpleServer\apps I have tried the solution below, but didn't work. I removed the files in workarea and checked for my jdk, which is jdk6_45 No runtime on my Worklight 6.2 Console Here's my server.xml <!-- Enable
  • Building Pylucene on ubuntu 14.04(trusty tahr)
    As per the installation instructions, JCC is successfully built. Dependencies Installed were: ant, openjdk-7-jdk, python-setuptools, python-dev. Then procedding to make pylucene, in "Makefile" i choose specs corresponding to Ubuntu 11. # Linux (Ubuntu 11.10 64-bit, Python 2.7.2, OpenJDK 1.7, setuptools 0.6.16) # Be sure to also set JDK['linux2'] in jcc's setup.py to the JAVA_HOME value # used below for ANT (and rebuild jcc after changing it). PREFIX_PYTHON=/usr ANT=JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 /usr/bin/ant PYTHON=$(PREFIX_PYTHON)/bin/python JCC=$(PYTHON) -m jcc --shared NUM
  • Oracle SQL Developer中的DB2数据库(DB2 database in Oracle SQL developer)
    问题 我听说可以使用Oracle SQL Developer这样的客户端连接到大型机DB2数据库。 我已经在线查看,似乎无法在SQL Developer中找到执行此操作所需的连接器文件。 任何人都可以将我定向到使此工作有效的链接吗? 或告诉我我是否只是在寻找错误的东西。 我已经将连接器与Oracle中的MySQL数据库一起使用,因此我认为它对于DB2数据库来说将是相似的。 回答1 连接到Db2的最简单方法是通过其JDBC Type 4 JCC驱动程序。 该驱动程序使用两个JAR: db2jcc4.jar,它是JDBC 4驱动程序(不建议使用db2jcc.jar JDBC 3驱动程序。) db2jcc_license_cisuz.jar,它允许驱动程序连接到所有Db2服务器平台,包括z / OS 您的大型机DBA应该能够为您提供这两个JAR,并通过适当的JDBC驱动程序选项帮助您构建连接字符串。 可以在以下位置找到有关Db2的JDBC驱动程序的更多信息:https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads 回答2 要在SQL Developer中启用DB2,需要拉出db2jcc.jar 转到“ Oracle SQL Developer”-“工具”-“首选项”->第三方JDBC驱动程序
  • java.lang.ClassNotFoundException:在Worklight平台或项目中找不到com.ibm.db2.jcc.DB2Driver类(java.lang.ClassNotFoundException: Class com.ibm.db2.jcc.DB2Driver not found in Worklight platform or project)
    问题 我尝试测试连接到db2的sql适配器,但得到以下结果: java.lang.ClassNotFoundException:在Worklight平台或项目中找不到com.ibm.db2.jcc.DB2Driver类 这是我的代码: <dataSourceDefinition> <driverClass>com.ibm.db2.jcc.DB2Driver</driverClass> <url>jdbc:db2://localhost:50000/WLTEST</url> <user>db2admin</user> <password>db2admin</password> </dataSourceDefinition> 知道出了什么问题吗? 回答1 您是说您的Worklight数据库是基于DB2的吗? 如是, 确保还使用正确的DB2设置编辑worklight.properties 还要确保: DB2连接器驱动程序存在于server \ lib文件夹中 回答2 添加两个罐子: db2jcc.jar db2jcc_license_cu.jar 将它们添加到库jar文件夹中。
  • x86-32 / x86-64多语言机器代码片段,可以在运行时检测到64位模式?(x86-32 / x86-64 polyglot machine-code fragment that detects 64bit mode at run-time?)
    问题 相同字节的机器代码能否确定它们是在32位还是64位模式下运行,然后执行不同的操作? 即编写多语言机器代码。 通常,您可以在构建时使用#ifdef宏进行检测。 或者在C语言中,您可以编写一个以编译时常量作为条件的if() ,然后让编译器对其另一端进行优化。 这仅对奇怪的情况有用,例如代码注入,或者只是看是否可行。 另请参见:多种语言的ARM / x86机器代码,根据解码字节的体系结构分支到不同的地址。 回答1 最简单的方法是使用一字节的inc操作码,该操作码在64位模式下被重新用作REX前缀。 REX前缀对jcc无效,因此您可以执行以下操作: xor eax,eax ; clear ZF db 0x40 ; 32bit: inc eax. 64bit: useless REX prefix jz .64bit_mode ; REX jcc works fine 另请参见根据其执行模式返回16、32或64的三向多语制:在codegolf.SE上确定语言的版本。 提醒:通常,您不希望将其作为已编译二进制文件的一部分。 在构建时检测模式,因此基于此的任何决策都可以优化,而不必在运行时执行。 例如,使用#ifdef __x86_64__和/或sizeof(void*) (但请不要忘记ILP32 x32 ABI在长模式下具有32位指针)。 这是一个完整的Linux / NASM程序
  • Parameter 0 of constructor in *Service required a bean of type '*Repository' that could not be found
    I'm completely new to Spring. and i have followed this tutorial Currently we are using DB2 on z/OS so i tried to connect to the DB2 using the same way the tutorial used to connect to H2. however after a struggle i successfully installed the Jar files using Maven. but when i try to run this Jar i get this error APPLICATION FAILED TO START Description: Parameter ٠ of constructor in ess.nbe.dev.essentis.services.Tgen008Service required a bean of type 'ess.nbe.dev.essentis.repo.Tgen008Repository' that could not be found. Action: Consider defining a bean of type 'ess.nbe.dev.essentis.repo
  • 带有InterBase JDBC驱动程序的NoClassDefFoundError sun / io / ByteToCharConverter(NoClassDefFoundError sun/io/ByteToCharConverter with InterBase JDBC driver)
    问题 使用InterClient 7.5.1和8.1.5,在Java 8中创建新的JDBC连接失败,并显示以下内容: java.lang.NoClassDefFoundError: sun/io/ByteToCharConverter 此类似乎由InterClient JDBC库引用或使用。 使用Java 7不会发生该错误。是否有解决此错误的方法? 这段代码在Java 8上重现了该问题: package com.example.so25365952; import java.sql.DriverManager; import java.sql.SQLException; import java.util.logging.Level; import java.util.logging.Logger; public class Main { interbase.interclient.Connection conn; public static void main(String[] args) { try { Class.forName("interbase.interclient.Driver"); DriverManager.getConnection("jdbc:interbase://localhost/data/mydb.gdb", "sysdba",
  • Connect to DB2 database in eclipse via jdbc
    I'm trying to connect to an IBM DB2 database with Eclipse (version Juno) via JDBC. I've added the drivers (external jar files) to my project and the driver is loaded correctly ... public static void main(String[] args) throws SQLException, ClassNotFoundException { Class.forName("com.ibm.db2.jcc.DB2Driver"); System.out.println("Driver loaded"); Connection dbConn = DriverManager.getConnection("jdbc:db2://***.**.***.*:50000/BWUEBDB", "username", "password"); System.out.println("Connected"); } I also know that the connection data (database path, username, password) is correct. But I get a java
  • java.lang.ClassNotFoundException: Class com.ibm.db2.jcc.DB2Driver not found in Worklight platform or project
    I try to test an sql adapter that connects to db2 but I get the following result: java.lang.ClassNotFoundException: Class com.ibm.db2.jcc.DB2Driver not found in Worklight platform or project here is my code: <dataSourceDefinition> <driverClass>com.ibm.db2.jcc.DB2Driver</driverClass> <url>jdbc:db2://localhost:50000/WLTEST</url> <user>db2admin</user> <password>db2admin</password> </dataSourceDefinition> any idea what is going wrong?
  • x86_64-组装-循环条件和故障(x86_64 - Assembly - loop conditions and out of order)
    问题 我不是在要求基准。 (如果是这种情况,我自己会做。 ) 我的问题: 为了方便起见,我倾向于避免使用间接/索引寻址模式。 作为替代,我经常使用立即寻址,绝对寻址或寄存器寻址。 编码: ; %esi has the array address. Say we iterate a doubleword (4bytes) array. ; %ecx is the array elements count (0x98767) myloop: ... ;do whatever with %esi add $4, %esi dec %ecx jnz 0x98767; 在这里,我们有一个序列化的组合(dec和jnz),可以防止乱序执行(依赖)。 有没有办法避免/破坏深度? (我不是汇编专家)。 回答1 在针对Intel CPU进行优化时,始终将标志设置指令放在条件跳转指令的前面(如果它是下表中列出的简单指令之一),因此它们可以在解码器中进行宏熔合到一个uop中。 对于不执行宏融合的较旧的CPU而言,这样做并不显着恶化。 较早地放置标志设置可能会使此类CPU的分支错误预测损失缩短一倍,但是无序执行意味着将dec提前移至一对指令不会带来真正的改变。 另请参阅避免通过尽早计算条件来使管道停顿。 为了真正发挥作用,您可以执行一些操作,例如展开循环和/或分支,使计算更简单,理想情况下无需依赖慢速输入
  • 将16个字节的字符串与SSE进行比较(Compare 16 byte strings with SSE)
    问题 我有16个字节的“字符串”(它们可能更短,但您可以假定它们的末尾用零填充),但您可能不假定它们是16字节对齐的(至少并非总是如此)。 如何编写一个例程,将它们(用于相等性)与SSE内在函数进行比较? 我发现此代码片段可能会有所帮助,但我不确定是否合适? register __m128i xmm0, xmm1; register unsigned int eax; xmm0 = _mm_load_epi128((__m128i*)(a)); xmm1 = _mm_load_epi128((__m128i*)(b)); xmm0 = _mm_cmpeq_epi8(xmm0, xmm1); eax = _mm_movemask_epi8(xmm0); if(eax==0xffff) //equal else //not equal 有人可以解释一下还是写一个函数体? 它需要在GCC / mingw中运行(在32位Windows上)。 回答1 向量比较指令根据相应源元素之间的比较,将其结果作为全1(真)或全0(假)的元素的掩码生成。 请参阅https://stackoverflow.com/tags/x86/info,以获取一些链接,这些链接将告诉您这些内在函数的作用。 问题中的代码看起来应该可以工作。 如果要找出哪些元素不相等,请使用movemask版本(
  • Weblogic: Call DB2 stored procedure without schema name (property currentSchema)
    I have a Java application that runs on Weblogic. The application needs to access a stored procedure in a DB2 data base, therefore a JDBC data source is configured and accessed by its JNDI name. Data source: ClassDriver: com.ibm.db2.jcc.DB2Driver Properties: user=MYUSER DatabaseName=MYDB The following example works as expected. Context env = null; DataSource pool = null; Hashtable ht = new Hashtable(); ht.put(Context.INITIAL_CONTEXT_FACTORY, "weblogic.jndi.WLInitialContextFactory"); ht.put(Context.PROVIDER_URL,"t3://myserver:7777"); env = new InitialContext(ht); pool = (DataSource) env.lookup(
  • Why is DB2 Type 4 JDBC Driver looking for native library db2jcct2?
    I thought the Type 4 JDBC driver was pure Java and wouldn't require native libraries. When I put db2jcc4.jar in the WEB-INF/lib directory of my Tomcat app packaged as a .war file, I get the following error when attempting to use the app: Got SQLException: com.ibm.db2.jcc.am.SqlException: [jcc][10389][12245][4.12.55] Failure in loading native library db2jcct2, java.lang.UnsatisfiedLinkError The relevant application code is as follows and the exception is thrown due to the last line in the listing: import com.ibm.db2.jcc.DB2SimpleDataSource; // ... DB2SimpleDataSource main_db2_data_source = new
  • Unreachable objects are not garbage collected from heap
    I'am struggling with unreachable objects in my JVM heap (Java 1.7). As you can see from the picture (all classes on the picture are unreachable), we have more than 74 % objects with no reference, so It should be garbagged collected. This state becomes after 3 weeks uptime on our tomcat 7 server where run only Probe monitoring app, tomcat manager and our webapp which is probably source of the problem. Our application is based on JSF 1.2 with state saving on client which is what you see in picture below - char arrays with ViewSaveState mostly. When I manually run GC from jVisualVM It removes all
  • 关于汇编跳转指令的说明
    虽然jmp指令提供了控制转移,但是它不允许进行任何复杂的判断。80x86条件跳转指令提供了这种判断。条件跳转指令是创建循环和实现其他条件执行语句,如if…endif的基本要素。 条件跳转指令检查一个或多个标志位,判断它们是否匹配某个特殊条件(就像setcc指令):如果标志匹配成功,该指令就将控制转移到目标位置;如果匹配失败,CPU忽略该条件跳转指令而继续执行下一条指令。一些条件跳转指令只是简单测试符号位(sign)、进位位(carry)、溢出位(overflow)、零标志(zero)位的设置。例如,在执行一条sh1指令后,您需要测试进位标志,来判断sh1是否从操作数的高地址位移出一位。类似地,也可以在一条test指令后测试零标志位,来判断指定的位是否为1。大多数情况,在cmp指令之后执行条件跳转指令。cmp指令设置标志位,以便判断小于、大于、等于等情况。 条件跳转指令形式如下: Jcc label; 其中,Jcc中的“cc”,必须用表示测试条件类型的字符序列替换。这些字符和setcc指令使用的一样。例如,“js”表示根据符号(sign)标志是否被置位来决定是否跳转。一个典型的js指令如下: js ValueIsNegative ; 在这个示例中,如果符号(sign)标志被置位,则js指令将控制转移到ValueIsNegative语句标号处;如果符号标志清零
  • 在Jython中使用NumPy和Cpython(Using NumPy and Cpython with Jython)
    问题 我必须使用商业Java库,并且想通过Python来实现。 Jython很健壮,我很高兴能在后面发布一些点。 但是,我也想使用NumPy,它显然不适用于Jython。 诸如CPype和Java数字库之类的选项不受欢迎。 前者基本上已经死了。 后者大多不成熟,缺乏NumPy的易用性和广泛接受性。 我的问题是:如何让Jython和Python代码互操作? 从Cpython或其他方式调用Jython对我来说是可以接受的。 回答1 具有讽刺意味的是,考虑到Jython和Numeric(NumPy的祖先)是由同一位开发人员(Jim Hugunin)发起的,他后来又发起了IronPython的工作,现在在Microsoft担任某种高级架构师的职位,致力于各种动态语言支持(.NET和Silverlight),在Jython中没有真正好的方法来使用numpy。 我所知道的最接近的东西是“数字”项目-(稀少的)文档在sourceforge上,但是更新的源在bitbucket上。 jnumerical实现的“ Numeric Python”不像其numpy子孙那样精巧和精简,但它具有相同的功能并共享许多概念和理念,因此也许您可以发现它有用-值得一试,至少。 回答2 考虑使用execnet,它使您可以结合Jython和CPython的优势,包括当前的NumPy。 此处的缺点是
  • <快于<=吗?(Is < faster than <=?)
    问题 if (a < 901)比if (a <= 900)快吗? 与这个简单示例不完全一样,但是循环复杂代码的性能略有变化。 我想这与生成的机器代码有关,以防万一。 回答1 不,它不会在大多数体系结构上更快。 您没有指定,但是在x86上,所有积分比较通常都将在两条机器指令中实现: test或cmp指令,用于设置EFLAGS 还有一条Jcc(跳转)指令,具体取决于比较类型(和代码布局): jne如果不相等则跳转-> ZF = 0 jz如果为零(等于)则跳转-> ZF = 1 jg更大时跳转-> ZF = 0 and SF = OF (等等...) 示例(为简洁起见,已编辑)与$ gcc -m32 -S -masm=intel test.c一起编译 if (a < b) { // Do something 1 } 编译为: mov eax, DWORD PTR [esp+24] ; a cmp eax, DWORD PTR [esp+28] ; b jge .L2 ; jump if a is >= b ; Do something 1 .L2: 和 if (a <= b) { // Do something 2 } 编译为: mov eax, DWORD PTR [esp+24] ; a cmp eax, DWORD PTR [esp+28] ; b jg .L5 ; jump if