天道酬勤,学无止境

java.io.IOException Not a data file after converting JSON to Avro with Avro Tools

I have a JSON file and an avro schema file, which correctly describes it's structure. I then convert the JSON file with the Avro tools into an avro file, without getting an error, like this:

java -jar .\avro-tools-1.7.7.jar fromjson --schema-file .\data.avsc .\data.json > .\data.avro

I then convert the generated Avro file back to JSON to verify that I got a valid Avro file like this:

java -jar .\avro-tools-1.7.7.jar tojson .\data.avro > .\data.json

This throws the error:

Exception in thread "main" java.io.IOException: Not a data file.
    at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
    at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
    at org.apache.avro.tool.DataFileGetMetaTool.run(DataFileGetMetaTool.java:64)
    at org.apache.avro.tool.Main.run(Main.java:84)
    at org.apache.avro.tool.Main.main(Main.java:73)

I get the same exception when doing 'getschema' or 'getmeta' and also if I use avro-tools-1.8.2 or avro-tools-1.7.4. I also tried it with multiple, varying pairs of json and schema data which I checked for validity.

The error is thrown here (in the Avro tools):

if (!Arrays.equals(DataFileConstants.MAGIC, magic)) {
    throw new IOException("Not a data file.");
}

It seems, the (binary) Avro file does not match the expected Avro file due to a few characters at the beginning.

I have checked all of the other stackoverflow questions regarding this error, but none of them helped. I used the command line on a Windows 10 PowerShell.

See https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/#json-to-binary-avro

Anyone got an idea what the heck is going on here?

UPDATE: The conversion works if I do it on a Cloudera VM instead of in Windows. Only a few bites at the beginning are different in the generated Avro files.

评论

Found the cause:

The Windows 10 PowerShell transforms the binary stream into a UTF8 stream. Changing the encoding changes the magic bytes, which (correctly) causes the exception to be thrown.

It works perfectly in another shell like the terminal etc.

Side note: the PowerShell app can be forced not to change the encoding by using a pipe instead of greater-than like so:

java -jar .\avro-tools-1.7.7.jar fromjson --schema-file .\data.avsc .\data.json | .\data.avro

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • 将 avro 生成的对象序列化为 json 时出现 JsonMappingException(JsonMappingException when serializing avro generated object to json)
    问题 我使用 avro-tools 从 avsc 文件生成 java 类,使用: java.exe -jar avro-tools-1.7.7.jar compile -string schema myfile.avsc 然后我尝试通过 ObjectMapper 将这些对象序列化为 json,但总是得到一个 JsonMappingException 说“不是枚举”或“不是联合”。 在我的测试中,我使用它的构建器或构造函数创建生成的对象。 我对不同类的对象有这样的例外...... 示例代码: ObjectMapper serializer = new ObjectMapper(); // com.fasterxml.jackson.databind serializer.register(new JtsModule()); // com.bedatadriven.jackson.datatype.jts ... return serializer.writeValueAsBytes(avroConvertedObject); // => JsonMappingException 我还尝试了许多配置使用:serializer.configure(...) 但仍然失败。 版本: Java 1.8、jackson-datatype-jts 2.3、jackson-core 2.6.5
  • 如何使用Apache Avro对Avro Binary编码JSON字符串?(How to Avro Binary encode the JSON String using Apache Avro?)
    问题 我正在尝试对我的JSON字符串进行二进制编码。 下面是我的JSON字符串,我创建了一个简单的方法来进行转换,但是我不确定自己的操作方式是否正确? public static void main(String args[]) throws Exception{ try{ Schema schema = new Parser().parse((TestExample.class.getResourceAsStream("/3233.avsc"))); String json="{"+ " \"location\" : {"+ " \"devices\":["+ " {"+ " \"did\":\"9abd09-439bcd-629a8f\","+ " \"dt\":\"browser\","+ " \"usl\":{"+ " \"pos\":{"+ " \"source\":\"GPS\","+ " \"lat\":90.0,"+ " \"long\":101.0,"+ " \"acc\":100"+ " },"+ " \"addSource\":\"LL\","+ " \"add\":["+ " {"+ " \"val\":\"2123\","+ " \"type\" : \"NUM\""+ " },"+ " {"+ " \"val\":\"Harris ST\","+ " \
  • Hudi保存时java.io.IOException: Not an Avro data file
    背景hudi版本:0.52 spark版本:2.4.4异常信息:2021-04-18 02:05:20 [ERROR] [pool-36-thread-9] org.apache.hudi.table.HoodieCommitArchiveLog:254 - Failed to archive commits, .commit file: 20210416170035.clean java.io.IOException: Not an Avro data file at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147) at org.apache.hudi.common.util.AvroUtils.deserializeHoodieCleanMetadata(AvroUtils.java:137) at org.apache.hudi.common.util.CleanerUtils.getCleanerMetadata(CleanerUtils.java:73) at org.apache.hudi.table
  • 如何修复预期的启动工会。 在命令行上将JSON转换为Avro时得到了VALUE_NUMBER_INT?(How to fix Expected start-union. Got VALUE_NUMBER_INT when converting JSON to Avro on the command line?)
    问题 我正在尝试使用Avro架构验证JSON文件并写入相应的Avro文件。 首先,我定义了以下名为user.avsc Avro模式: {"namespace": "example.avro", "type": "record", "name": "user", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } 然后创建一个user.json文件: {"name": "Alyssa", "favorite_number": 256, "favorite_color": null} 然后尝试运行: java -jar ~/bin/avro-tools-1.7.7.jar fromjson --schema-file user.avsc user.json > user.avro 但是我得到以下异常: Exception in thread "main" org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT at org
  • 深入理解Kafka Connect:转换器和序列化
    Kafka Connect 是 Apache Kafka 的一部分,为其他数据存储和 Kafka 提供流式集成。对于数据工程师来说,他们只需要配置一下 JSON 文件就可以了。Kafka 提供了一些可用于常见数据存储的连接器,如 JDBC、Elasticsearch、IBM MQ、S3 和 BigQuery,等等。对于开发人员来说,Kafka Connect 提供了丰富的 API,如果有必要还可以开发其他连接器。除此之外,它还提供了用于配置和管理连接器的 REST API。Kafka Connect 是一种模块化组件,提供了一种非常强大的集成方法。一些关键组件包括:连接器——定义如何与数据存储集成的 JAR 文件;转换器——处理数据的序列化和反序列化;变换——可选的运行时消息操作。人们对 Kafka Connect 最常见的误解与数据的序列化有关。Kafka Connect 使用转换器处理数据序列化。接下来让我们看看它们是如何工作的,并说明如何解决一些常见问题。Kafka 消息都是字节Kafka 消息被保存在主题中,每条消息就是一个键值对。当它们存储在 Kafka 中时,键和值都只是字节。Kafka 因此可以适用于各种场景,但这也意味着开发人员需要决定如何序列化数据。在配置 Kafka Connect 时,序列化格式是最关键的配置选项之¸
  • 序列化中带有 Avro NullPointerException 的 MRUnit(MRUnit with Avro NullPointerException in Serialization)
    问题 我正在尝试使用 MRUnit 测试 Hadoop .mapreduce Avro 作业。 我收到 NullPointerException ,如下所示。 我附上了一部分 pom 和源代码。 任何援助将不胜感激。 谢谢 我得到的错误是: java.lang.NullPointerException at org.apache.hadoop.mrunit.internal.io.Serialization.copy(Serialization.java:73) at org.apache.hadoop.mrunit.internal.io.Serialization.copy(Serialization.java:91) at org.apache.hadoop.mrunit.internal.io.Serialization.copyWithConf(Serialization.java:104) at org.apache.hadoop.mrunit.TestDriver.copy(TestDriver.java:608) at org.apache.hadoop.mrunit.MapDriverBase.setInputKey(MapDriverBase.java:64) at org.apache.hadoop.mrunit.MapDriverBase.setInput
  • How to fix Expected start-union. Got VALUE_NUMBER_INT when converting JSON to Avro on the command line?
    I'm trying to validate a JSON file using an Avro schema and write the corresponding Avro file. First, I've defined the following Avro schema named user.avsc: {"namespace": "example.avro", "type": "record", "name": "user", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Then created a user.json file: {"name": "Alyssa", "favorite_number": 256, "favorite_color": null} And then tried to run: java -jar ~/bin/avro-tools-1.7.7.jar fromjson --schema-file user.avsc user.json > user.avro But
  • MRUnit with Avro NullPointerException in Serialization
    I'm trying to test a Hadoop .mapreduce Avro job using MRUnit. I am receiving a NullPointerException as seen below. I've attached a portion of the pom and source code. Any assistance would be appreciated. Thanks The error I'm getting is : java.lang.NullPointerException at org.apache.hadoop.mrunit.internal.io.Serialization.copy(Serialization.java:73) at org.apache.hadoop.mrunit.internal.io.Serialization.copy(Serialization.java:91) at org.apache.hadoop.mrunit.internal.io.Serialization.copyWithConf(Serialization.java:104) at org.apache.hadoop.mrunit.TestDriver.copy(TestDriver.java:608) at org
  • Avro: How can I use default fields when I don't know the exact schema that the “writer” used
    In Java Avro, how do I parse data1, data2 and data3 below to a GenericRecord. //Schema { "type": "record", "name": "user", "fields": [ {"name": "name", "type": "string"}, {"name": "colour", "type": "string", "default": "green"}, {"name": "mass", "type": "int", "default": 100} ] } //data 1 {"name":"Sean"} //data 2 {"name":"Sean", "colour":"red"} //data 3 {"name":"Sean", "colour":"red", "mass":200} I've seen some discussion on schema evolution etc, and the ability pass a writer's schema and a reader's schema to GenericDatumReader and ResolvingDecoder, but I only have one schema. In general I don
  • Hive query execution: Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Found double, expecting union
    I am trying to execute a simple select * from table limit 1; Statement in hive on an external table. But facing failue with execption: java.io.IOException:org.apache.avro.AvroTypeException: Found double, expecting union. Can Someone help me understand what this means? I have checked the schema file and the "default":null is already given. What is the exact reason for this exception occuring? I tried understanding an already existing discussion. The schema looks something like this: {"type":"record", "name":"VISIBILITY", "namespace":"pentaho_etl", "fields":[ {"name":"ID", "type":["null","long"]
  • 将 JSON 转换为 Parquet(Convert JSON to Parquet)
    问题 我有一些 JSON 格式的 TB 日志数据,我想将它们转换为 Parquet 格式以在分析阶段获得更好的性能。 我设法通过编写一个使用 parquet-mr 和 parquet-avro 的 mapreduce java 作业来做到这一点。 我唯一不满意的是,我的 JSON 日志没有固定模式,我不知道所有字段的名称和类型。 此外,即使我知道所有字段的名称和类型,我的模式也会随着时间的推移而发展,例如,将来会添加新的字段。 现在我必须为AvroWriteSupport提供一个 Avro 模式,而 avro 只允许固定数量的字段。 有没有更好的方法在 Parquet 中存储任意字段,就像 JSON 一样? 回答1 可以肯定的一件事是 Parquet 需要提前使用 Avro 模式。 我们将重点介绍如何获取模式。 使用 SparkSQL 将 JSON 文件转换为 Parquet 文件。 SparkSQL 可以从数据中自动推断出模式,因此我们不需要自己提供模式。 每次数据发生变化时,SparkSQL 都会推断出不同的模式。 手动维护 Avro 架构。 如果您不使用 Spark 而只使用 Hadoop,则需要手动推断架构。 首先编写一个 mapreduce 作业来扫描所有 JSON 文件并获取所有字段,在知道所有字段后,您可以编写一个 Avro 模式。 使用此架构将 JSON 文件转换为
  • Flink BucketingSink with Custom AvroParquetWriter create empty file
    I have created a writer for BucketingSink. The sink and writer works without error but when it comes to the writer writing avro genericrecord to parquet, the file was created from in-progress, pending to complete. But the files are empty with 0 bytes. Can anyone tell me what is wrong with the code ? I have tried placing the initialization of AvroParquetWriter at the open() method, but result still the same. When debugging the code, I confirm that writer.write(element) does executed and element contain the avro genericrecord data Streaming Data BucketingSink<DataEventRecord> sink = new
  • 在 Parquet 数据上使用 Avro 架构动态创建 Hive 外部表(Dynamically create Hive external table with Avro schema on Parquet Data)
    问题 我正在尝试动态地(没有在 Hive DDL 中列出列名和类型)在镶木地板数据文件上创建一个 Hive 外部表。 我有底层镶木地板文件的 Avro 架构。 我的尝试是使用以下 DDL: CREATE EXTERNAL TABLE parquet_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath' TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc'); 我的 Hive 表已使用正确的架构成功创建,但是当我尝试读取数据时: SELECT * FROM parquet_test; 我收到以下错误: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable 有没有一种方法可以成功创建和读取 Parquet 文件,而无需在 DDL 中提及列名和类型列表? 回答1 以下查询有效: CREATE TABLE avro_test ROW FORMAT
  • Trouble with Avro serialization of json documents missing fields
    I'm trying to use Apache Avro to enforce a schema on data exported from Elastic Search into a lot of Avro documents in HDFS (to be queried with Drill). I'm having some trouble with Avro defaults Given this schema: { "namespace" : "avrotest", "type" : "record", "name" : "people", "fields" : [ {"name" : "firstname", "type" : "string"}, {"name" : "age", "type" :"int", "default": -1} ] } I'd expect that a json document such as {"firstname" : "Jane"} would be serialized using the default value of -1 for the age field. default: A default value for this field, used when reading instances that lack
  • 从某个 Java 对象生成 Avro Schema(Generate Avro Schema from certain Java Object)
    问题 Apache Avro 为序列化提供了一种紧凑、快速、二进制数据格式、丰富的数据结构。 但是,它需要用户为需要序列化的对象定义模式(以 JSON 格式)。 在某些情况下,这是不可能的(例如:该 Java 对象的类有一些成员,其类型是外部库中的外部 Java 类)。 因此,我想知道有一种工具可以从对象的 .class 文件中获取信息并为该对象生成 Avro 模式(如 Gson 使用对象的 .class 信息将某些对象转换为 JSON 字符串)。 回答1 查看 Java 反射 API。 获取架构如下所示: Schema schema = ReflectData.get().getSchema(T); 有关工作示例,请参阅 Doug 在另一个问题上的示例。 这个答案的学分属于肖恩巴斯比。 回答2 以下是从 POJO 定义生成 Avro Schema 的方法 ObjectMapper mapper = new ObjectMapper(new AvroFactory()); AvroSchemaGenerator gen = new AvroSchemaGenerator(); mapper.acceptJsonFormatVisitor(RootType.class, gen); AvroSchema schemaWrapper = gen.getGeneratedSchema()
  • 使用 Java 将 Json 对象转换为 Parquet 格式而不转换为 AVRO(不使用 Spark、Hive、Pig、Impala)(Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala))
    问题 我有一个场景,其中使用 Java 将作为 Json 对象存在的消息转换为 Apache Parquet 格式。 任何示例代码或示例都会有所帮助。 就我发现的将消息转换为 Parquet 而言,正在使用 Hive、Pig、Spark。 我需要转换为 Parquet 而不涉及这些只通过 Java。 回答1 要将 JSON 数据文件转换为 Parquet,您需要一些内存中的表示。 Parquet 没有自己的 Java 对象集; 相反,它重用来自其他格式的对象,如 Avro 和 Thrift。 这个想法是 Parquet 与您的应用程序可能已经使用的对象本地工作。 要转换您的 JSON,您需要将记录转换为 Avro内存对象并将它们传递给 Parquet,但您不需要将文件转换为 Avro,然后再转换为 Parquet。 已经为您完成了到 Avro 对象的转换,请参阅 Kite 的 JsonUtil,并且可以用作文件阅读器。 转换方法需要 Avro 架构,但您可以使用相同的库从 JSON 数据推断 Avro 架构。 要写入这些记录,您只需要使用ParquetAvroWriter 。 整个设置如下所示: Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20); try
  • Dynamically create Hive external table with Avro schema on Parquet Data
    I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file. My try is to use below DDL: CREATE EXTERNAL TABLE parquet_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath' TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc'); My Hive table is successfully created with the right schema, but when I try to read the data : SELECT * FROM parquet_test; I get the following error : java.io
  • 如何在Python中将JSON字符串转换为Avro?(How to convert JSON string to Avro in Python?)
    问题 有没有办法在不使用Python的架构定义的情况下将JSON字符串转换为Avro? 还是只有Java才能处理的事情? 回答1 Apache Avro™1.7.6入门(Python): import avro.schema avro.schema.parse(json_schema_string) 回答2 最近,我遇到了同样的问题,最终我开发了一个python包,该包可以采用任何python数据结构,包括解析后的JSON并将其存储在Avro中,而无需专用的架构。 我为python 3进行了测试。 您可以将其安装为pip3 install rec-avro或在https://github.com/bmizhen/rec-avro上查看代码和文档 用法示例: from fastavro import writer, reader, schema from rec_avro import to_rec_avro_destructive, from_rec_avro_destructive, rec_avro_schema def json_objects(): return [{'a': 'a'}, {'b':'b'}] # For efficiency, to_rec_avro_destructive() destroys rec, and reuses it's # data
  • How to avro binary encode my json string to a byte array?
    I have a actual JSON String which I need to avro binary encode to a byte array. After going through the Apache Avro specification, I came up with the below code. I am not sure whether this is the right way to do it or not. Can anyone take a look whether the way I am trying to avro binary encode my JSON String is correct or not?. I am using Apache Avro 1.7.7 version. public class AvroTest { private static final String json = "{" + "\"name\":\"Frank\"," + "\"age\":47" + "}"; private static final String schema = "{ \"type\":\"record\", \"namespace\":\"foo\", \"name\":\"Person\", \"fields\":[ { \
  • Generate Avro Schema from certain Java Object
    Apache Avro provides a compact, fast, binary data format, rich data structure for serialization. However, it requires user to define a schema (in JSON) for object which need to be serialized. In some case, this can not be possible (e.g: the class of that Java object has some members whose types are external java classes in external libraries). Hence, I wonder there is a tool can get the information from object's .class file and generate the Avro schema for that object (like Gson use object's .class information to convert certain object to JSON string).