[POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

Swapnil Shinde
Hello All
    We are observing incorrect query results with carbondata 1.1.1. Please find details below -

Datasets used -
     TPC-H star schema based datasets (http://www.cs.umb.edu/~poneil/StarSchemaB.PDF)
Query - 
     select cCustKey,loCustKey from customer, lineorder where loCustkey = cCustKey
How we load data -
     We validated loading data through dataframe and "INSERT" statements and both ways produce incorrect results. I am putting one way here-


-- CREATE CUSTOMER TABLE

carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName string, cAddress string, cCity string, cNation string, cRegion string, cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")

carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE customer OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")

 

-- CREATE LINEORDER TABLE

carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy String) STORED BY 'carbondata'")

carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO TABLE lineorder OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")


Results with different version - 

   1.1.0 - Provides correct results for above query. Validated with results from parquet.

   1.1.1 - Built from this. Join is missing lots of rows compared to parquet.

   1.1.1 - Built from source code available for download. Join is missing lots of rows compared to parquet.

      1.2 - Built from master branch. Generated correct results similar to parquet.


Debugging further - 

1. Row counts for both lineOrder and customer tables are same.

2. If I try to find out key column in carbondata vs parquet then it is matching as well -

         val cd = carbon.sql("select cCustKey from customer") //.distinct.count -- 30,000,000

         val sp = spark.sql("select cCustKey from pcustomer") //.distinct.count -- 30,000,000

         cd.intersect(sp) -- 30,000,000 (carbon data has same keys compared to parquet)

 

         val cd = carbon.sql("select loCustKey from lineorder") //.distinct.count -- 13,365,986

         val sp = spark.sql("select loCustKey from plineorder") //.distinct.count -- 13,365,986

         cd.intersect(sp) --13,365,986 (carbon data has same keys compared to parquet)


Above query shows that carbondata customer and lineitem has same key values compared to parquet.

However, when you run above join query, carbondata generates very small subset of expected rows. If we run filter query for any specific key then that also returns no results.


Not sure why v1.1.1 is producing incorrect results. My guess is that carbondata is skipping rows that it shouldn't in v1.1.1.

Any help and suggestions are very much appreciated!! Thanks in advance..



Thanks

Swapnil Shinde





 




Reply | Threaded
Open this post in threaded view
|

Re: [POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

Ravindra Pesala
Hi,

I have verified using tpch tables with 1 GB generated data. on 1.1.1  but I got below result. I don't have the exact schema as you mentioned but with original TPCH schema, I verified.

0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey) from customer, orders where o_Custkey = c_CustKey;
+-------------------+-------------------+--+
| count(c_CustKey)  | count(o_CustKey)  |
+-------------------+-------------------+--+
| 1500000           | 1500000           |
+-------------------+-------------------+--+


On parquet with same data.

0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey) from customer, orders where o_Custkey = c_CustKey;
+-------------------+-------------------+--+
| count(c_CustKey)  | count(o_CustKey)  |
+-------------------+-------------------+--+
| 1500000           | 1500000           |
+-------------------+-------------------+--+


Regards,
Ravindra.

On 23 August 2017 at 19:40, Swapnil Shinde <[hidden email]> wrote:
Hello All
    We are observing incorrect query results with carbondata 1.1.1. Please find details below -

Datasets used -
     TPC-H star schema based datasets (http://www.cs.umb.edu/~poneil/StarSchemaB.PDF)
Query - 
     select cCustKey,loCustKey from customer, lineorder where loCustkey = cCustKey
How we load data -
     We validated loading data through dataframe and "INSERT" statements and both ways produce incorrect results. I am putting one way here-


-- CREATE CUSTOMER TABLE

carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName string, cAddress string, cCity string, cNation string, cRegion string, cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")

carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE customer OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")

 

-- CREATE LINEORDER TABLE

carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy String) STORED BY 'carbondata'")

carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO TABLE lineorder OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")


Results with different version - 

   1.1.0 - Provides correct results for above query. Validated with results from parquet.

   1.1.1 - Built from this. Join is missing lots of rows compared to parquet.

   1.1.1 - Built from source code available for download. Join is missing lots of rows compared to parquet.

      1.2 - Built from master branch. Generated correct results similar to parquet.


Debugging further - 

1. Row counts for both lineOrder and customer tables are same.

2. If I try to find out key column in carbondata vs parquet then it is matching as well -

         val cd = carbon.sql("select cCustKey from customer") //.distinct.count -- 30,000,000

         val sp = spark.sql("select cCustKey from pcustomer") //.distinct.count -- 30,000,000

         cd.intersect(sp) -- 30,000,000 (carbon data has same keys compared to parquet)

 

         val cd = carbon.sql("select loCustKey from lineorder") //.distinct.count -- 13,365,986

         val sp = spark.sql("select loCustKey from plineorder") //.distinct.count -- 13,365,986

         cd.intersect(sp) --13,365,986 (carbon data has same keys compared to parquet)


Above query shows that carbondata customer and lineitem has same key values compared to parquet.

However, when you run above join query, carbondata generates very small subset of expected rows. If we run filter query for any specific key then that also returns no results.


Not sure why v1.1.1 is producing incorrect results. My guess is that carbondata is skipping rows that it shouldn't in v1.1.1.

Any help and suggestions are very much appreciated!! Thanks in advance..



Thanks

Swapnil Shinde





 







--
Thanks & Regards,
Ravi