天道酬勤,学无止境

apache-spark

无法使用 Java 连接到 HBase(Cannot connect to HBase using Java)

问题 我正在尝试使用 Java 连接 HBase。 只有1个节点,这是我自己的机器。 似乎无法成功连接。 这是我的Java代码: public class Test { public static void main(String[] args) throws MasterNotRunningException, ZooKeeperConnectionException, IOException, ServiceException { SparkConf conf = new SparkConf().setAppName("Test").setMaster("spark://10.239.58.111:7077"); JavaSparkContext sc = new JavaSparkContext(conf); sc.addJar("/home/cloudera/workspace/Test/target/Test-0.0.1-SNAPSHOT.jar"); Configuration hbaseConf = HBaseConfiguration.create(); hbaseConf.addResource(new Path("/usr/lib/hbase/conf/hbase-site.xml")); HTable table = new HTable(hbaseConf

2021-06-21 13:04:30    分类:技术分享    java   hbase   apache-spark

Extract and Visualize Model Trees from Sparklyr

Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries and (ultimately) b.) a visualization of the trees for non-technical consumption? This would include the ability to convert back to the actual feature names from the substituted string indexing values that are produced during the vector assembler. The following code is copied liberally from a sparklyr blog post for the purposes of providing an example

2021-06-21 13:00:06    分类:问答    r   apache-spark   random-forest   decision-tree   sparklyr

如何减少 Spark 运行时输出的冗长?(How to reduce the verbosity of Spark's runtime output?)

问题 如何减少 Spark 运行时产生的跟踪信息量? 默认太冗长, 如何关闭它,并在需要时打开它。 谢谢 详细模式 scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781)) scala> la.collect 15/01/28 09:57:24 INFO SparkContext: Starting job: collect at <console>:15 15/01/28 09:57:24 INFO DAGScheduler: Got job 3 (collect at <console>:15) with 1 output ... 15/01/28 09:57:24 INFO Executor: Running task 0.0 in stage 3.0 (TID 3) 15/01/28 09:57:24 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 626 bytes result sent to driver 15/01/28 09:57:24 INFO DAGScheduler: Stage 3 (collect at <console>:15) finished in 0.002 s 15/01/28 09:57:24 INFO

2021-06-21 12:31:11    分类:技术分享    scala   apache-spark

How can I make Spark Streaming count the words in a file in a unit test?

I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala. When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C. Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words. What am I missing? Below is the unit test file, and after that I've also included the

2021-06-21 12:20:36    分类:问答    java   unit-testing   apache-spark   spark-streaming

Spark 应用程序抛出 javax.servlet.FilterRegistration(Spark application throws javax.servlet.FilterRegistration)

问题 我正在使用 Scala 在本地创建和运行 Spark 应用程序。 我的 build.sbt: name : "SparkDemo" version : "1.0" scalaVersion : "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" exclude("org.apache.hadoop", "hadoop-client") libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.2.0" libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.6.0" excludeAll( ExclusionRule(organization = "org.eclipse.jetty")) libraryDependencies += "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.6.0" libraryDependencies += "org.apache.hbase" % "hbase-client" % "0.98.4-hadoop2"

2021-06-21 12:01:47    分类:技术分享    scala   intellij-idea   sbt   apache-spark

Why does spark-ml ALS model returns NaN and negative numbers predictions?

Actually I'm trying to use ALS from spark-ml with implicit ratings. I noticed that some predictions given by my trained model are negative or NaN, why is it?

2021-06-21 11:53:29    分类:问答    apache-spark   pyspark   apache-spark-mllib

Spark SQL:如何在不使用 rdd.cache() 的情况下缓存 sql 查询结果(Spark SQL: how to cache sql query result without using rdd.cache())

问题 有没有办法在不使用 rdd.cache() 的情况下缓存缓存 sql 查询结果? 举些例子: output = sqlContext.sql("SELECT * From people") 我们可以使用output.cache()来缓存结果,但是我们不能使用 sql 查询来处理它。 所以我想问一下有没有像sqlcontext.cacheTable()这样的东西来缓存结果? 回答1 您应该使用sqlContext.cacheTable("table_name")来缓存它,或者使用CACHE TABLE table_name SQL 查询。 这是一个例子。 我在 HDFS 上有这个文件: 1|Alex|alex@gmail.com 2|Paul|paul@example.com 3|John|john@yahoo.com 然后是 PySpark 中的代码: people = sc.textFile('hdfs://sparkdemo:8020/people.txt') people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2])) tbl = sqlContext.inferSchema(people_t) tbl.registerTempTable(

2021-06-21 11:45:58    分类:技术分享    caching   query-optimization   apache-spark

Spark remove duplicate rows from DataFrame [duplicate]

This question already has answers here: How to select the first row of each group? (7 answers) Closed 5 years ago. Assume that I am having a DataFrame like : val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}""")) val df = sqlContext.read.json(json) I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only {"a":3, "b":9, "c":22, "d":12}

2021-06-21 11:40:46    分类:问答    scala   apache-spark   dataframe   apache-spark-sql

Getting emr-ddb-hadoop.jar to connect DynamoDB with EMR Spark

I have a DynamoDB table that I need to connect to EMR Spark SQL to run queries on the table. I got the EMR Spark Cluster with release label emr-4.6.0 and Spark 1.6.1 on it. I am referring to the document: Analyse DynamoDB Data with Spark After connecting to the master node, I run the command: spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar It gives a warning: Warning: Local jar /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar does not exist, skipping. Later, when I import the DynamoDB Input Format using import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat import org.apache

2021-06-21 11:34:05    分类:问答    hadoop   amazon-web-services   apache-spark   amazon-dynamodb

YARN REST API - Spark job submission

I am trying to use the YARN REST API to submit the spark-submit jobs, which I generally run via command line. My command line spark-submit looks like this JAVA_HOME=/usr/local/java7/ HADOOP_CONF_DIR=/etc/hadoop/conf /usr/local/spark-1.5/bin/spark-submit \ --driver-class-path "/etc/hadoop/conf" \ --class MySparkJob \ --master yarn-cluster \ --conf "spark.executor.extraClassPath=/usr/local/hadoop/client/hadoop-*" \ --conf "spark.driver.extraClassPath=/usr/local/hadoop/client/hadoop-*" \ spark-job.jar --retry false --counter 10 Reading through the YARN REST API documentation https://hadoop.apache

2021-06-21 10:45:15    分类:问答    hadoop   apache-spark   yarn