天道酬勤,学无止境

parsing chinese characters in java showing weird behaviour

I am having a csv file which has some fields having chinese character strings. Unfortunately i dont know what is encoding of this input csv file. I am trying to read this input csv and using selective fields from it, i am making a html and another csv file as output.

While reading csv input, i tried all encoding from list http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html which have Chinese mentioned in their description. And found if I use

InputStreamReader read = new InputStreamReader(filepath,"GB18030");

for reading csv and

OutputStreamWriter osW=new OutputStreamWriter(objBufferedOutputStream,"UTF-16");

For writing html and csv, my output doesnt show weird characters.

But, there are 2 problems:

  1. The output is showing strings which are altogether different from input ! I mean, even when im not doing any processing on any string from my code, the output is not found in any field of input csv.

For example, my input has a chinese char string: 陈真珍 on field number 8. but my output html has something like: 闄堢湡鐝� which corresponds to input field number 8.

  1. as u can see, there is a questionmark, i.e. replacement char from unicode in output 闄堢湡鐝�

I request you to kindly help me trace where can be a mistake here...

PS: Aiso, I checked Google translation and found,input string 陈真珍 means some Chen Zhen Zhen

and its corresponding output string 闄堢湡鐝� means something called as Yaobaoyujue So there is difference in meaning as well as representation of characters also.

评论

That output means that your input is NOT in GB18030 encoding.

Also: please check and double-check how you view your files: what encoding does the program use that opens the files, specifically the input file. Usually text files (and CSV files) don't come with metadata attached to them that shows their encoding, so the editors have to guess and that guess can easily be wrong.

Please keep the enconding be consistent when reading / writing Chinese character. Since some Chinese character may not be represented by the all the encodings, such as GBK, GB18030 etc.

You can have a try to use UTF-8 enconding to handle Chinese character.

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • 解析java程序编译时编码GBK的不可映射字符的错误
    对于java初学者来说,怕的不是没有热情,而是学习中遇到问题无法解决的焦躁。最近,小编在学习java时就遇到了这么一个问题,在编译java源程序的时候出现报错:HelloWorld.java:8: 错误: 编码 GBK 的不可映射字符 (0x80)。这样的错误,对于初学的小编来说久久不能解决,在多次百度学习之后,终于领会了其中的奥秘。 java程序在编译的时候,需要使用JDK开发工具包中的JAVAC.EXE命令,而JDK开发工具包是国际版的,默认是对UNICODE的编码格式的源文件进行编译操作。因此其他编码格式的源文件需要先转为UNICODE格式才能够进行编译。在未指定源程序文件的编码格式的情况下,JDK会优先获取操作系统的file.encoding参数,然后JDK就把我们的JAVA源程序从file.encoding编码格式转化为JAVA内部默认的UNICODE格式放到内存中。如果源文件的编码格式不是file.encoding,就好比用英汉词典转换英法文本,映射怪怪的,得到了扭曲的原文表现,就是俗称的乱码,在此基础上做编译,就会发现不能够正确对应的中文字部分报错!然后,javac把转换后的UNICODE格式的文件进行编译成class类文件,此时.class文件是UNICODE编码的,它暂放在内存中,之后,JDK将此以UNICODE编码编译后的
  • JFreechart candlestick chart weird behaviour on drag
    This is a follow up question from this question. What happens is the following: When I launch the graph and I drag the graph around, something weird happens: at a certain interval, it seems every 7 periods, the candlesticks get smaller and smaller untill they are only a stripe. Then when I drag further, they become thicker again until they are normal size again. This seems to happen for every 7 periods. An example of this phenomenon is displayed on to the following 3 pictures: The following code will show exactly what I mean. Just compile and run it. Then press and hold CTRL and click and hold
  • Read/write .txt file with special characters
    I open Notepad (Windows) and write Some lines with special characters Special: Žđšćč and go to Save As... "someFile.txt" with Encoding set to UTF-8. In Java I have FileInputStream fis = new FileInputStream(new File("someFile.txt")); InputStreamReader isr = new InputStreamReader(fis, "UTF-8"); BufferedReader in = new BufferedReader(isr); String line; while((line = in.readLine()) != null) { printLine(line); } in.close(); But I get question marks and similar "special" characters. Why? EDIT: I have this input (one line in .txt file) 665,Žđšćč and this code FileInputStream fis = new FileInputStream
  • Swing中未呈现Unicode字符,实际使用什么字体?(Unicode char not rendering in Swing, what font is used in real?)
    问题 我在swing应用程序中显示unicode字符时遇到问题。 在思想上,问题在于所使用的字体,该字体不包含适用于中文的字符。 (仅显示空框) 这是我的问题的更多信息(我做了一些调查): Linux(Kubuntu 14.04) : 当我使用JAVA 6启动程序时,不会显示中文字符(仅显示空框)。 (通过getFont()请求标签字体返回:DejaVu Sans) 当我使用JAVA 7启动程序时,中文字符会正确显示! (通过getFont()请求标签字体返回:DejaVu Sans) Windows(8.1) : 当我用JAVA 6启动程序时,中文字符会正确显示! (通过getFont()请求标签字体返回:SansSerif) 当我使用JAVA 7启动程序时,中文字符会正确显示! (通过getFont()请求标签字体返回:SansSerif) $ JAVA_HOME / lib / fonts(似乎用作后备字体)两个Java版本(6 + 7)都包含相同的字体。 (在两个系统上) 字体文件具有相同的大小(Java6 + Java7),并且fontconfig.properties.src也都相同。 当我直接问标签(通过getFont())时,它在Windows(8.1)上返回“ SansSerif”,在我的Kubuntu(14.04)上返回“ DejaVu Sans”。
  • JFreechart烛台图拖动时的怪异行为(JFreechart candlestick chart weird behaviour on drag)
    问题 这是该问题的后续问题。 发生了以下情况: 当我启动图形并拖动图形时,发生了一些奇怪的事情:每隔一定的时间间隔,似乎每隔7个周期,烛台就会越来越小,直到它们只是一条条纹为止。 然后,当我进一步拖动时,它们会再次变粗,直到再次达到正常大小。 这似乎每7个周期就会发生一次。 此现象的一个示例显示在以下3张图片上: 以下代码将准确显示我的意思。 只需编译并运行它。 然后,按住CTRL ,并在图形上单击并按住鼠标。 现在尝试将图形向右或向左拖动。 经过一定的“拖动距离”后,您将注意到该错误。 我的问题:如何预防/解决此问题? 代码: import org.jfree.chart.*; import org.jfree.chart.axis.*; import org.jfree.chart.plot.XYPlot; import org.jfree.chart.renderer.xy.CandlestickRenderer; import org.jfree.data.xy.*; import javax.swing.*; import java.awt.*; import java.io.*; import java.net.URL; import java.text.*; import java.util.*; import java.util.List; public class
  • CENTOS 7 和 JDK 添加中文字体
    写在前面的话 当运维总是遇到各种奇奇怪怪的问题,比如新的 JAVA 项目上线,login 界面有个验证码,结果部署后发现,要么显示的奇奇怪怪,要么压根不显示。或者在使用一些开源的 JAVA 项目的时候,部署之后出现乱码,然后就懵逼了。如果你遇到这种情况,其实很多时候都是系统字体和 JDK 字体的原因,当然也不全部,比如页面编码,数据存储本身就乱码等各种问题。这里就是给出一种解决问题的思路,希望能够帮到需要的同学! 环境说明 名称版本操作系统CENTOS 7JDK7远程工具Xshell 5 系统添加中文字体 对于服务器而言,我们在安装的时候一般都是最小化安装,所以但部分都不会关注字体这方面带来的影响,这里我们执行命令看下当前系统的字体:fc-list效果如图:可以从上图看到,没有任何一个中文出现,这其实就足够说明目前虚拟机是没有中文字体的,分话不多说,开始添加字体: 【1】在我们的 Windows 的 C:\Windows\Fonts 下面选择一个中文字体,如宋体,先拷贝到桌面,然后字体就变成了英文的:SIMSUN.TTC 备注:我这里只是写了 Windows 的,没有用过 Mac 系列的 ... 【2】在服务器上面建立相关目录,为了便于区分,我们把目录名字叫做 zh_CN:mkdir /usr/share/fonts/zh_CN 【3】上传我们的字体到该目录下并改名为 simsun
  • Java TreeMap custom comparator weird behaviour
    I am trying to create a Map with sorted keys, sorted according to alphabetically first, and numerical last. For this I am using a TreeMap with a custom Comparator: public static Comparator<String> ALPHA_THEN_NUMERIC_COMPARATOR = new Comparator<String> () { @Override public int compare(String first, String second) { if (firstLetterIsDigit(first)) { return 1; } else if (firstLetterIsDigit(second)) { return -1; } return first.compareTo(second); } }; private static boolean firstLetterIsDigit(String string) { return (string == null) ? false : Character.isDigit(string.charAt(0)); } I've wrote the
  • Cannot write chinese characters to a filename
    public static void main(String[] args) throws IOException { Scanner in = new Scanner(System.in); String fileName = in.nextLine(); Writer out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream("C:/temp/"+fileName+".txt"), "UTF-8"));//Ex thrown out.close(); } I'm trying to create a writer that can handle chinese characters to the file name. So I can create a file called 你好.txt for example. However I get a FileNotFoundException with the above code, it works perfectly fine for English characters but not with Chinese characters. I followed the answers here: How to write a UTF-8 file
  • 读写具有特殊字符的.txt文件(Read/write .txt file with special characters)
    问题 我打开记事本(Windows)并编写 Some lines with special characters Special: Žđšćč 并转到“另存为... ”“ someFile.txt”,将“编码”设置为UTF-8 。 在Java中,我有 FileInputStream fis = new FileInputStream(new File("someFile.txt")); InputStreamReader isr = new InputStreamReader(fis, "UTF-8"); BufferedReader in = new BufferedReader(isr); String line; while((line = in.readLine()) != null) { printLine(line); } in.close(); 但是我得到了问号和类似的“特殊”字符。 为什么? 编辑:我有此输入(.txt文件中的一行) 665,Žđšćč 和这段代码 FileInputStream fis = new FileInputStream(new File(fileName)); InputStreamReader isr = new InputStreamReader(fis, "UTF-8"); BufferedReader in = new
  • Weird problem with preg_replace and chinese character
    i have this werid problem. After a preg_replace, some chinese character became funky character. this is the script. $message = strip_tags(mysql_real_escape_string($_POST['message']),'<img><vid>'); echo $message; $message = removewhitespace($message); echo $message; function removewhitespace($a) { return preg_replace('/(\\\r\\\n\\\r\\\n)+/','\r\n\r\n', preg_replace('/^(\\\r\\\n)+|(\\\r\\\n)+$/', '', preg_replace('/\s+/', ' ', preg_replace('/^\s+|\s+$/', '', $a)))); } The display would be 好不好你 好不好� Any ideas?
  • EditText weird behaviour in ListView BaseAdapter
    I got a ListView, populated using BaseAdapter. In the listview Item there's a numeric EditText: ... <EditText android:id="@+id/edit_quantita" android:layout_width="50dp" android:layout_height="30dp" android:layout_gravity="center" android:layout_marginTop="5dp" android:background="@drawable/edit_quantita" android:gravity="center_vertical|center_horizontal" android:inputType="number" android:text="1" android:textColor="#fff" tools:ignore="HardcodedText" > </EditText> ... When I tap on this EditText the numerical keyboard prompts for an instant, and then it's suddenly overlayed by a regular
  • pymssql 出现中文乱码的解决方法
    在新项目中,重开新的虚拟环境安装了pymssql,在查询数据的时候,在设置了 charset=‘utf8’ 的前提下,有部分查询结果出现了中文乱码,有部分又没有,特别怪异。 pymssql.set_max_connections(1024) self.conn = pymssql.connect(host="%s:%s" % (config["sql_server"]["ip"], config["sql_server"]["port"]), user=config["sql_server"]["user"], password=config["sql_server"]["password"], database=config["sql_server"]["database"], charset='utf8') cursor = self.conn.cursor(as_dict=True) if not cursor: raise(NameError, "数据库连接失败!") return cursor 参考其他博客,设置 charset=‘cp936’,中文乱码正常了,但是之前能正常显示的却变成乱码了。 但是之前的项目也是用的pymssql,也是对同一个数据库做查询操作,但是都没出现过这种怪异的现象。 一查两个项目的pymssql版本,老项目是2.1.3,新项目是2.1.5
  • 读取数字和换行符时出现奇怪的 scanf 行为(Weird scanf behaviour when reading number and newline)
    问题 在使用 C 8 年后,我现在才意识到 scanf 的这个“错误”。 下面的 scanf 代码将跳过第二行输入中的前导空白字符。 int x; char in[100]; scanf("%d\n",&x); gets(in); 输入: 1 s x将包含1 ,但in将只是"s"而不是" s" 这是标准 C 还是 gcc 行为? 回答1 scanf格式字符串中的空白字符将导致scanf消耗任何(和所有)空白,直到出现非空白字符。 这似乎是标准的scanf行为,不限于 gcc。 回答2 它不是scanf的错误, scanf的手册说, 一系列空白字符(空格、制表符、换行符等;请参阅isspace(3) )。 该指令匹配输入中任意数量的空格,包括无空格。 这意味着任何带有指令为%d\n空白字符将读取一个数字,然后在输入中消耗一系列空白字符,并且只有在您键入非空白字符时才会返回。 那你如何只能看到"s"而没有空格。 回答3 中的'\n' (对于格式字符串中的任何空白字符都是如此) scanf("%d\n", &x); 匹配输入中任意数量的空白字符( isspace函数返回 1 的字符,即 true,例如换行符、空格、制表符等),而不仅仅是换行符'\n' 。 这意味着scanf将读取输入中的所有空白字符并丢弃它们,直到遇到非空白字符。 这解释了您观察到的行为。
  • 初始化字符串数据的numpy数组的怪异行为(Weird behaviour initializing a numpy array of string data)
    问题 当数组包含字符串数据时,我在使用numpy时遇到一些琐碎的麻烦。 我有以下代码: my_array = numpy.empty([1, 2], dtype = str) my_array[0, 0] = "Cat" my_array[0, 1] = "Apple" 现在,当我用print my_array[0, :]打印它时,得到的响应是['C', 'A'] ,这显然不是Cat和Apple的预期输出。 为什么会这样,如何获得正确的输出? 谢谢! 回答1 Numpy要求字符串数组具有固定的最大长度。 当您使用dtype=str创建一个空数组时,默认情况下会将最大长度设置为1。 您可以查看是否执行my_array.dtype ; 它将显示“ | S1”,表示“一个字符的字符串”。 随后的数组分配将被截断以适应此结构。 您可以通过以下方式传递最大长度的显式数据类型,例如: my_array = numpy.empty([1, 2], dtype="S10") “ S10”将创建一个长度为10的字符串数组。 您必须确定足够大的大小才能容纳要保留的所有数据。 回答2 当我尝试使用带有dtype="S10"的非ascii字符时出现“编解码器错误dtype="S10" 您还会得到一个带有二进制字符串的数组,这让我感到困惑。 我认为最好使用: my_array = numpy.empty(
  • 转换流
    转换流 1.字符编码和字符集 计算机中储存的信息都是用二进制数表示的,而我们在屏幕上看到的数字、英文、标点符号、汉字等字符是二进制数转换之后的结果。按照某种规则,将字符存储到计算机中,称为编码 。反之,将存储在计算机中的二进制数按照某种规则解析显示出来,称为解码 。比如说,按照A规则存储,同样按照A规则解析,那么就能显示正确的文本符号。反之,按照A规则存储,再按照B规则解析,就会导致乱码现象。 编码:字符(能看懂的)–字节(看不懂的) 解码:字节(看不懂的)–>字符(能看懂的) 字符编码 字符编码Character Encoding : 就是一套自然语言的字符与二进制数之间的对应规则。 编码表:生活中文字和计算机中二进制的对应规则 字符集 计算机要准确的存储和识别各种字符集符号,需要进行字符编码,一套字符集必然至少有一套字符编码。常见字符集有ASCII字符集、GBK字符集、Unicode字符集等 字符集 Charset:也叫编码表。是一个系统支持的所有字符的集合,包括各国家文字、标点符号、图形符号、数字等。 可见,当指定了字符编码,它所对应的字符集自然就指定了。 ASCII字符集 : ASCII(American Standard Code for Information Interchange,美国信息交换标准代码)是基于拉丁字母的一套电脑编码系统,用于显示现代英语
  • Java TreeMap 自定义比较器奇怪的行为(Java TreeMap custom comparator weird behaviour)
    问题 我正在尝试创建一个带有排序键的Map ,首先按字母顺序排序,最后按数字排序。 为此,我使用带有自定义Comparator的TreeMap : public static Comparator<String> ALPHA_THEN_NUMERIC_COMPARATOR = new Comparator<String> () { @Override public int compare(String first, String second) { if (firstLetterIsDigit(first)) { return 1; } else if (firstLetterIsDigit(second)) { return -1; } return first.compareTo(second); } }; private static boolean firstLetterIsDigit(String string) { return (string == null) ? false : Character.isDigit(string.charAt(0)); } 我编写了以下单元测试来说明出了什么问题: @Test public void testNumbericallyKeyedEntriesCanBeStored() { Map<String, String> map
  • 如何在PHP中正确显示中文字符?(How to properly display Chinese characters in PHP?)
    问题 我有这些汉字: 汉字/漢字''test 如果我做 echo utf8_encode($chinesevar); 它显示 ??/??''test 或者即使我只是做一个简单的 echo $chinesevar 它仍然显示一些奇怪的字符... 那么,如何在不使用<meta>标记和UTF-8事物..或ini_set UTF-8事物甚至UTF-8的header()事物的情况下显示这些汉字呢? 回答1 简单的: 将您的源代码保存在UTF-8中输出一个HTTP标头,以向浏览器指定应使用UTF-8解释页面: header('Content-Type: text/html; charset=utf-8'); 完毕。 utf8_encode用于将Latin-1编码的字符串转换为UTF-8。 不用了 有关更多详细信息,请参见在Web应用程序中处理Unicode从头到尾。 回答2 看起来您的文件在没有BOM的UTF8中,并且您的网络服务器在UTF-8中交付了您的网站 HTML: <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> 在PHP中: header('Content-Type: text/html; charset=utf-8'); 而且,如果您使用数据库,则从数据库中读取文本时,数据库将处于UTF-8。
  • What causes this weird behaviour in the randomForest.partialPlot function?
    I am using the randomForest package (v. 4.6-7) in R 2.15.2. I cannot find the source code for the partialPlot function and am trying to figure out exactly what it does (the help file seems to be incomplete.) It is supposed to take the name of a variable x.var as an argument: library(randomForest) data(iris) rf <- randomForest(Species ~., data=iris) x1 <- "Sepal.Length" partialPlot(x=rf, pred.data=iris, x.var=x1) # Error in `[.data.frame`(pred.data, , xname) : undefined columns selected partialPlot(x=rf, pred.data=iris, x.var=as.character(x1)) # works! typeof(x1) # [1] "character" x1 == as
  • MS Access query returning Chinese characters - possible table corruption?
    I copied and pasted a new version of the data into my MS Access table and now I'm getting weird characters in my queries. Essentially if I say: SELECT a, b from table1 everything is fine. If I instead do SELECT a, b from table1 group by a, b I get really weird characters as a result. At first I got upside down L's, but now I'm getting Chinese characters. It's weird because other queries in my database use the table and get the desired output. It seems like it's only when I do a group by that I have the problems. Any suggestions? I was ready to roll it out, but now I'm getting these errors!
  • Can't make (UTF-8) traditional Chinese character to work in PHP gettext extension (.po and .mo files created in poEdit)
    I checked MSDN and the locale string is zh_Hant, but I also tried with zh_TW (Chinese, Taiwan). The traditional Chinese characters look OK in the poEditor, but when I open the file in the browser the characters are just weird symbols («¢Åo¥@¬É!). I think the translation is working, but there's something wrong with the encoding (I used UTF-8 for both Charset and Source Code Charset). The files generated with poEditor: messages.po: msgid "" msgstr "" "Project-Id-Version: \n" "Report-Msgid-Bugs-To: \n" "POT-Creation-Date: 2010-02-15 16:26+0800\n" "PO-Revision-Date: 2010-02-15 16:26+0800\n" "Last