Issue with quickstart introduction

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue with quickstart introduction

Arnaud G

Hi,

 

I have compiled the latest version of CarbonData which is compatible with HDP2.6. I’m doing the following steps but the data are never copied to the table.

 

Start Spark Shell:

/home/ubuntu/carbondata# spark-shell --jars /home/ubuntu/carbondata/carbondata_2.11-1.2.0-SNAPSHOT-shade-hadoop2.7.2.jar

 

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0.2.6.0.3-8

      /_/

 

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala>  import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SparkSession

 

scala> import org.apache.spark.sql.CarbonSession._

import org.apache.spark.sql.CarbonSession._

 

scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("/test/carbondata/","/test/carbondata/")

17/07/26 14:58:42 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.

17/07/26 14:58:42 WARN CarbonProperties: main The enable unsafe sort value "null" is invalid. Using the default value "false

17/07/26 14:58:42 WARN CarbonProperties: main The custom block distribution value "null" is invalid. Using the default value "false

17/07/26 14:58:42 WARN CarbonProperties: main The enable vector reader value "null" is invalid. Using the default value "true

17/07/26 14:58:42 WARN CarbonProperties: main The value "null" configured for key carbon.lock.type" is invalid. Using the default value "HDFSLOCK

carbon: org.apache.spark.sql.SparkSession = org.apache.spark.sql.CarbonSession@5f7bd970

 

scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_carbon(id string, name string, city string,age Int)  STORED BY 'carbondata'")

17/07/26 15:04:35 AUDIT CreateTable: [gateway-dc1r04n01][hdfs][Thread-1]Creating Table with Database name [default] and Table name [test_carbon]

17/07/26 15:04:36 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `default`.`test_carbon` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

17/07/26 15:04:36 AUDIT CreateTable: [gateway-dc1][hdfs][Thread-1]Table created with Database name [default] and Table name [test_carbon]

res7: org.apache.spark.sql.DataFrame = []

 

scala> carbon.sql("describe test_carbon").show()

+--------+---------+-------+

|col_name|data_type|comment|

+--------+---------+-------+

|      id|   string|   null|

|    name|   string|   null|

|    city|   string|   null|

|     age|      int|   null|

+--------+---------+-------+

 

 

scala> carbon.sql("INSERT INTO test_carbon VALUES(1,'x1','x2',34)")

17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load request has been received for table default.test_carbon

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 batch sort size is set to 0

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$: [gateway-dc1r04n01][hdfs][Thread-1]Data load is successful for default.test_carbon

res11: org.apache.spark.sql.DataFrame = []

 

scala> carbon.sql("LOAD DATA INPATH 'hdfs://xxxx/test/carbondata/sample.csv' INTO TABLE test_carbon")

17/07/26 14:59:28 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load request has been received for table default.test_table

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] batch sort size is set to 0

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:29 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load is successful for default.test_table

res1: org.apache.spark.sql.DataFrame = []

 

 

scala> carbon.sql("Select * from test_carbon").show()

java.io.FileNotFoundException: File /test/carbondata/default/test_table/Fact/Part0/Segment_0 does not exist.

  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1081)

  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1059)

  at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1004)

  at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1000)

  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

  at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1000)

  at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1735)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatusInternal(CarbonInputFormat.java:862)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatus(CarbonInputFormat.java:845)

  at org.apache.carbondata.hadoop.CarbonInputFormat.listStatus(CarbonInputFormat.java:802)

  at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplitsInternal(CarbonInputFormat.java:319)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getTableBlockInfo(CarbonInputFormat.java:523)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSegmentAbstractIndexs(CarbonInputFormat.java:616)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getDataBlocksOfSegment(CarbonInputFormat.java:441)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(CarbonInputFormat.java:379)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(CarbonInputFormat.java:302)

  at org.apache.carbondata.spark.rdd.CarbonScanRDD.getPartitions(CarbonScanRDD.scala:81)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)

  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)

  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2378)

  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)

  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2384)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2120)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2119)

  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)

  at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)

  at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)

  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)

 at org.apache.spark.sql.Dataset.show(Dataset.scala:638)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:597)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:606)

  ... 50 elided

 

I have check the folder on HDFS and there is a structure /test/carbondata/default/test_carbon/ but the folder is empty.


I’m pretty sure that I’m missing silly, but I have not been able to find a way to insert data in the table.

 

On another subject, I’m trying to also access this through presto, but here the error is always: Query 20170726_145207_00005_ytsnk failed: line 1:1: Schema 'default' does not exist

 

I’m also a little bit lost as from Spark it seems that the table are created in the hive metastore, but the Presto plugin doesn’t seem to refer to it.

 

Thanks for reading!

 

AG
Reply | Threaded
Open this post in threaded view
|

Re: Issue with quickstart introduction

Divya Gupta
Thanks for your interest in CarbonData.

/test/carbondata/default/test_carbon/  folder is empty because the data load failed.

Inserting single or multiple rows in the CarbonData table, using the Values clause with Insert statement, is currently not supported in CarbonData. Please try loading data using a CSV file and the Load statement. For e.g.
carbon.sql("LOAD DATA INPATH 'sample.csv file path' INTO TABLE carbon_test")

The csv file to be used can be either on the local disk or on HDFS.

Regards
Divya Gupta


On Wed, Jul 26, 2017 at 9:29 PM, Arnaud G <[hidden email]> wrote:

Hi,

 

I have compiled the latest version of CarbonData which is compatible with HDP2.6. I’m doing the following steps but the data are never copied to the table.

 

Start Spark Shell:

/home/ubuntu/carbondata# spark-shell --jars /home/ubuntu/carbondata/carbondata_2.11-1.2.0-SNAPSHOT-shade-hadoop2.7.2.jar

 

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0.2.6.0.3-8

      /_/

 

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala>  import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SparkSession

 

scala> import org.apache.spark.sql.CarbonSession._

import org.apache.spark.sql.CarbonSession._

 

scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("/test/carbondata/","/test/carbondata/")

17/07/26 14:58:42 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.

17/07/26 14:58:42 WARN CarbonProperties: main The enable unsafe sort value "null" is invalid. Using the default value "false

17/07/26 14:58:42 WARN CarbonProperties: main The custom block distribution value "null" is invalid. Using the default value "false

17/07/26 14:58:42 WARN CarbonProperties: main The enable vector reader value "null" is invalid. Using the default value "true

17/07/26 14:58:42 WARN CarbonProperties: main The value "null" configured for key carbon.lock.type" is invalid. Using the default value "HDFSLOCK

carbon: org.apache.spark.sql.SparkSession = org.apache.spark.sql.CarbonSession@5f7bd970

 

scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_carbon(id string, name string, city string,age Int)  STORED BY 'carbondata'")

17/07/26 15:04:35 AUDIT CreateTable: [gateway-dc1r04n01][hdfs][Thread-1]Creating Table with Database name [default] and Table name [test_carbon]

17/07/26 15:04:36 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `default`.`test_carbon` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

17/07/26 15:04:36 AUDIT CreateTable: [gateway-dc1][hdfs][Thread-1]Table created with Database name [default] and Table name [test_carbon]

res7: org.apache.spark.sql.DataFrame = []

 

scala> carbon.sql("describe test_carbon").show()

+--------+---------+-------+

|col_name|data_type|comment|

+--------+---------+-------+

|      id|   string|   null|

|    name|   string|   null|

|    city|   string|   null|

|     age|      int|   null|

+--------+---------+-------+

 

 

scala> carbon.sql("INSERT INTO test_carbon VALUES(1,'x1','x2',34)")

17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load request has been received for table default.test_carbon

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 batch sort size is set to 0

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$: [gateway-dc1r04n01][hdfs][Thread-1]Data load is successful for default.test_carbon

res11: org.apache.spark.sql.DataFrame = []

 

scala> carbon.sql("LOAD DATA INPATH 'hdfs://xxxx/test/carbondata/sample.csv' INTO TABLE test_carbon")

17/07/26 14:59:28 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load request has been received for table default.test_table

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] batch sort size is set to 0

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:29 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load is successful for default.test_table

res1: org.apache.spark.sql.DataFrame = []

 

 

scala> carbon.sql("Select * from test_carbon").show()

java.io.FileNotFoundException: File /test/carbondata/default/test_table/Fact/Part0/Segment_0 does not exist.

  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1081)

  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1059)

  at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1004)

  at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1000)

  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

  at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1000)

  at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1735)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatusInternal(CarbonInputFormat.java:862)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatus(CarbonInputFormat.java:845)

  at org.apache.carbondata.hadoop.CarbonInputFormat.listStatus(CarbonInputFormat.java:802)

  at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplitsInternal(CarbonInputFormat.java:319)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getTableBlockInfo(CarbonInputFormat.java:523)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSegmentAbstractIndexs(CarbonInputFormat.java:616)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getDataBlocksOfSegment(CarbonInputFormat.java:441)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(CarbonInputFormat.java:379)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(CarbonInputFormat.java:302)

  at org.apache.carbondata.spark.rdd.CarbonScanRDD.getPartitions(CarbonScanRDD.scala:81)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)

  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)

  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2378)

  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)

  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2384)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2120)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2119)

  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)

  at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)

  at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)

  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)

 at org.apache.spark.sql.Dataset.show(Dataset.scala:638)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:597)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:606)

  ... 50 elided

 

I have check the folder on HDFS and there is a structure /test/carbondata/default/test_carbon/ but the folder is empty.


I’m pretty sure that I’m missing silly, but I have not been able to find a way to insert data in the table.

 

On another subject, I’m trying to also access this through presto, but here the error is always: Query 20170726_145207_00005_ytsnk failed: line 1:1: Schema 'default' does not exist

 

I’m also a little bit lost as from Spark it seems that the table are created in the hive metastore, but the Presto plugin doesn’t seem to refer to it.

 

Thanks for reading!

 

AG

Reply | Threaded
Open this post in threaded view
|

Re: Issue with quickstart introduction

xuchuanyin
This post has NOT been accepted by the mailing list yet.
In reply to this post by Arnaud G
I encountered this problem before.

Please check whether the directory `/test/carbondata/default/test_carbon/` exist in your local machine.
If it exists, the problems may lie in
```
SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("/test/carbondata/","/test/carbondata/")
```

To solve this, create CarbonSession by providing the store-location with hdfs qualifier.
Reply | Threaded
Open this post in threaded view
|

Re: Issue with quickstart introduction

Arnaud G
In reply to this post by Divya Gupta
Hi,

Thanks for your answer:

That's what I did (next line under the insert statement in my first mail):

scala> carbon.sql("LOAD DATA INPATH 'hdfs://xxxx/test/carbondata/sample.csv' INTO TABLE test_carbon")

17/07/26 14:59:28 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load request has been received for table default.test_table

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] batch sort size is set to 0

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:29 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load is successful for default.test_table

res1: org.apache.spark.sql.DataFrame = []


I don't see any error but still no data appear in the table. What is the best way to get more information on what is going on behind the scene to understand where it may failed?


Thanks


On Thu, Jul 27, 2017 at 6:53 AM, Divya Gupta <[hidden email]> wrote:
Thanks for your interest in CarbonData.

/test/carbondata/default/test_carbon/  folder is empty because the data load failed.

Inserting single or multiple rows in the CarbonData table, using the Values clause with Insert statement, is currently not supported in CarbonData. Please try loading data using a CSV file and the Load statement. For e.g.
carbon.sql("LOAD DATA INPATH 'sample.csv file path' INTO TABLE carbon_test")

The csv file to be used can be either on the local disk or on HDFS.

Regards
Divya Gupta


On Wed, Jul 26, 2017 at 9:29 PM, Arnaud G <[hidden email]> wrote:

Hi,

 

I have compiled the latest version of CarbonData which is compatible with HDP2.6. I’m doing the following steps but the data are never copied to the table.

 

Start Spark Shell:

/home/ubuntu/carbondata# spark-shell --jars /home/ubuntu/carbondata/carbondata_2.11-1.2.0-SNAPSHOT-shade-hadoop2.7.2.jar

 

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0.2.6.0.3-8

      /_/

 

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala>  import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SparkSession

 

scala> import org.apache.spark.sql.CarbonSession._

import org.apache.spark.sql.CarbonSession._

 

scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("/test/carbondata/","/test/carbondata/")

17/07/26 14:58:42 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.

17/07/26 14:58:42 WARN CarbonProperties: main The enable unsafe sort value "null" is invalid. Using the default value "false

17/07/26 14:58:42 WARN CarbonProperties: main The custom block distribution value "null" is invalid. Using the default value "false

17/07/26 14:58:42 WARN CarbonProperties: main The enable vector reader value "null" is invalid. Using the default value "true

17/07/26 14:58:42 WARN CarbonProperties: main The value "null" configured for key carbon.lock.type" is invalid. Using the default value "HDFSLOCK

carbon: org.apache.spark.sql.SparkSession = org.apache.spark.sql.CarbonSession@5f7bd970

 

scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_carbon(id string, name string, city string,age Int)  STORED BY 'carbondata'")

17/07/26 15:04:35 AUDIT CreateTable: [gateway-dc1r04n01][hdfs][Thread-1]Creating Table with Database name [default] and Table name [test_carbon]

17/07/26 15:04:36 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `default`.`test_carbon` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

17/07/26 15:04:36 AUDIT CreateTable: [gateway-dc1][hdfs][Thread-1]Table created with Database name [default] and Table name [test_carbon]

res7: org.apache.spark.sql.DataFrame = []

 

scala> carbon.sql("describe test_carbon").show()

+--------+---------+-------+

|col_name|data_type|comment|

+--------+---------+-------+

|      id|   string|   null|

|    name|   string|   null|

|    city|   string|   null|

|     age|      int|   null|

+--------+---------+-------+

 

 

scala> carbon.sql("INSERT INTO test_carbon VALUES(1,'x1','x2',34)")

17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load request has been received for table default.test_carbon

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 batch sort size is set to 0

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$: [gateway-dc1r04n01][hdfs][Thread-1]Data load is successful for default.test_carbon

res11: org.apache.spark.sql.DataFrame = []

 

scala> carbon.sql("LOAD DATA INPATH 'hdfs://xxxx/test/carbondata/sample.csv' INTO TABLE test_carbon")

17/07/26 14:59:28 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load request has been received for table default.test_table

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] batch sort size is set to 0

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch worker for task 0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e] sort scope is set to LOCAL_SORT

17/07/26 14:59:29 AUDIT CarbonDataRDDFactory$: [gateway-dc1][hdfs][Thread-1]Data load is successful for default.test_table

res1: org.apache.spark.sql.DataFrame = []

 

 

scala> carbon.sql("Select * from test_carbon").show()

java.io.FileNotFoundException: File /test/carbondata/default/test_table/Fact/Part0/Segment_0 does not exist.

  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1081)

  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1059)

  at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1004)

  at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1000)

  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

  at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1000)

  at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1735)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatusInternal(CarbonInputFormat.java:862)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatus(CarbonInputFormat.java:845)

  at org.apache.carbondata.hadoop.CarbonInputFormat.listStatus(CarbonInputFormat.java:802)

  at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplitsInternal(CarbonInputFormat.java:319)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getTableBlockInfo(CarbonInputFormat.java:523)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSegmentAbstractIndexs(CarbonInputFormat.java:616)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getDataBlocksOfSegment(CarbonInputFormat.java:441)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(CarbonInputFormat.java:379)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(CarbonInputFormat.java:302)

  at org.apache.carbondata.spark.rdd.CarbonScanRDD.getPartitions(CarbonScanRDD.scala:81)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)

  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)

  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2378)

  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)

  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2384)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2120)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2119)

  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)

  at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)

  at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)

  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)

 at org.apache.spark.sql.Dataset.show(Dataset.scala:638)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:597)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:606)

  ... 50 elided

 

I have check the folder on HDFS and there is a structure /test/carbondata/default/test_carbon/ but the folder is empty.


I’m pretty sure that I’m missing silly, but I have not been able to find a way to insert data in the table.

 

On another subject, I’m trying to also access this through presto, but here the error is always: Query 20170726_145207_00005_ytsnk failed: line 1:1: Schema 'default' does not exist

 

I’m also a little bit lost as from Spark it seems that the table are created in the hive metastore, but the Presto plugin doesn’t seem to refer to it.

 

Thanks for reading!

 

AG


Reply | Threaded
Open this post in threaded view
|

Re: Issue with quickstart introduction

Erlu Chen
In reply to this post by Arnaud G
I think the key point is  following command.

SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("/test/carbondata/","/test/carbondata/")

It seems you specify a local path as store location while your default FileSystem is HDFS, so carbon can not find this path in HDFS, please make sure your store location is match your file system.

You can check one parameter named fs.defaultFS in core-site.xml.

please have a try, may be it can solve your problem.

Regards.
Chenerlu.