天道酬勤,学无止境

hadoop

Get the sysdate -1 in Hive

Is there any way to get the current date -1 in Hive means yesterdays date always? And in this format- 20120805? I can run my query like this to get the data for yesterday's date as today is Aug 6th- select * from table1 where dt = '20120805'; But when I tried doing this way with date_sub function to get the yesterday's date as the below table is partitioned on date(dt) column. select * from table1 where dt = date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(), 'yyyyMMdd')) , 1) limit 10; It is looking for the data in all the partitions? Why? Something wrong I am doing in my query? How I can make

2021-06-15 17:35:12    分类:问答    hadoop   mapreduce   hive   hiveql

Flatten tuple like a bag

My dataset looks like the following: ( A, (1,2) ) ( B, (2,9) ) I would like to "flatten" the tuples in Pig, basically repeating each record for each value found in the inner-tuple, such that the expected output is: ( A, 1 ) ( A, 2 ) ( B, 2 ) ( B, 9 ) I know this is possible when the tuples (1,2) and (2,9) are bags instead.

2021-06-15 12:25:35    分类:问答    hadoop   apache-pig   flatten

Kafka Streams with lookup data on HDFS

I'm writing an application with Kafka Streams (v0.10.0.1) and would like to enrich the records I'm processing with lookup data. This data (timestamped file) is written into a HDFS directory on daily basis (or 2-3 times a day). How can I load this in the Kafka Streams application and join to the actual KStream? What would be the best practice to reread the data from HDFS when a new file arrives there? Or would it be better switching to Kafka Connect and write the RDBMS table content to a Kafka topic which can be consumed by all the Kafka Streams application instances? Update: As suggested Kafka

2021-06-15 11:48:15    分类:问答    hadoop   apache-kafka   apache-kafka-streams   confluent-platform   apache-kafka-connect

YARN: Containers and JVM

Can someone help me understand the relation between JVM and containers in YARN? How JVMs are created, is it one JVM for each task? can multiple tasks run in the same JVM at the same time? (I'm aware of ubertasking where many tasks (maps/reduce) can run in same JVM one after the other). Is it one JVM for each container? or multiple containers in a single JVM? or there is no relation between JVM and containers? when a resource manager allocates containers for a job, does multiple tasks inside the same job use same container for tasks running in same node? or separate containers for each task

2021-06-15 11:48:03    分类:问答    java   hadoop   jvm   yarn   hadoop-2.7.2

Read data from remote hive on spark over JDBC returns empty result

I need to execute hive queries on remote hive server from spark, but for some reasons i receive only column names(without data). Data available in table, i checked it via HUE and java jdbc connection. Here is my code example: val test = spark.read .option("url", "jdbc:hive2://remote.hive.server:10000/work_base") .option("user", "user") .option("password", "password") .option("dbtable", "some_table_with_data") .option("driver", "org.apache.hive.jdbc.HiveDriver") .format("jdbc") .load() test.show() Output: +-------+ |dst.col| +-------+ +-------+ I know that data vailable on this table. Scala

2021-06-15 11:33:20    分类:问答    scala   hadoop   apache-spark   jdbc   hive

EMR vs EC2/Hadoop on AWS

I know that EC2 is more flexible but more work over EMR. However in terms of costs, if using EC2 it probably requires EBS volumes attached to the EC2 instances, whereas AWS just streams in data from S3. So crunching the numbers on the AWS calculator, even though for EMR one must pay for EC2 also, EMR becomes cheaper than EC2 ?? Am i wrong here ? Of course EC2 with EBS is probably faster, but is it worth the cost ? thanks, Matt

2021-06-15 10:45:27    分类:问答    hadoop   amazon-web-services   amazon-ec2   emr

Trouble with Avro serialization of json documents missing fields

I'm trying to use Apache Avro to enforce a schema on data exported from Elastic Search into a lot of Avro documents in HDFS (to be queried with Drill). I'm having some trouble with Avro defaults Given this schema: { "namespace" : "avrotest", "type" : "record", "name" : "people", "fields" : [ {"name" : "firstname", "type" : "string"}, {"name" : "age", "type" :"int", "default": -1} ] } I'd expect that a json document such as {"firstname" : "Jane"} would be serialized using the default value of -1 for the age field. default: A default value for this field, used when reading instances that lack

2021-06-15 09:47:13    分类:问答    hadoop   serialization   schema   avro

Plugin not found in plugin repository - How fix an issue when my company Nexus is down?

I am trying to buld Hadoop locally and when I do $ mvn -U clean install -Pdist -Dtar -Ptest-patch as mentioned - http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment [ERROR] Error resolving version for plugin 'org.apache.maven.plugins:maven-javadoc-plugin' from the repositories [local (/Users/me/.m2/repository), nexus (http://beefy.myorg.local:8081/nexus/content/groups/public)]: Plugin not found in any plugin repository -> [Help 1] As I see logs on console, I see [INFO] Apache Hadoop Distribution [INFO] Apache Hadoop Client [INFO] Apache Hadoop Mini-Cluster [INFO] [INFO] --------

2021-06-15 09:44:24    分类:问答    java   maven   hadoop   nexus

How to compute sum of a field in all the rows from an alias

What I want to do is to sum values of a field in all rows in an alias. This must be simple but somehow I can't find the answer. This is probably because what I want is a scalar value while PIG handles datasets? I guess I can create a row with a field which is the sum? Please advise!

2021-06-15 08:21:13    分类:问答    hadoop   apache-pig

Cannot create directory /home/hadoop/hadoopinfra/hdfs/namenode/current

I get the error Cannot create directory /home/hadoop/hadoopinfra/hdfs/namenode/current While trying to install hadoop on my local Mac. What could be the reason for this? Just for reference, I'm putting my xml files down below: mapred-site.xml: <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> hdfs-site.xml: <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value> </property> <property> <name>dfs.data.dir<

2021-06-15 07:54:39    分类:问答    hadoop   hdfs