Query regarding behaviour of sort column

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Query regarding behaviour of sort column

Pallavi Singh
This post was updated on .
Hi Community,

While working with the sort columns, I saw that if the column listed in the
sort_column happens to have a null value in the data while sorting , then
the row corresponding to that null value is eliminated from the result set.
Is this correct behavior? Ideally null values should be sorted and listed
on the top in the result set.

--
Regards | Pallavi Singh
Software Consultant

Reply | Threaded
Open this post in threaded view
|

Re: Query regarding behaviour of sort column

Pallavi Singh
Hi Community,

While working with the above problem I found two discussions regarding the sort column 

1. https://github.com/apache/carbondata/pull/635 which states : If the table need be sorted by a measure, we should use dictionary_include to add it to dimension list.

2. https://github.com/apache/carbondata/pull/757 : if a column of SORT_COLUMNS is a measure before, now this column will be created as a dimension. And this dimension is a no-dicitonary column(Better to use other direct-dictionary).

Now if the columns in my sort_column be measures then I have to add the same columns in the dictionary_include other wise in case of null value in case of sort_column column the loading fails after the first null encounter itself. 

for example like this :
CREATE TABLE test_sort_col
| (id INT,
| name STRING,
| age INT
| )STORED BY 'org.apache.carbondata.format'
| TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')
and the csv has following data :
1,Pallavi,25
2,Geetika,24
3,Prabhat,twenty six
7,Neha,25
2,Geetika,22
3,Sangeeta,26
and the load gets successful like shown below :

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1| Pallavi|  25|
|  2| Geetika|  22|
|  2| Geetika|  24|
|  3| Prabhat|null|
|  3|Sangeeta|  26|
|  7|    Neha|  25|
+---+--------+----+

now if i remove the measures of the sort_column from the dictionary_include in the  query I get an error and partial data gets loaded, snapshot is provided below 

Data load request has been received for table default.test_sort_col
17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl: pool-20-thread-1 
java.lang.ClassCastException: java.lang.String cannot be cast to [B
at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
at org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150)
at org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder: [pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size : 2
17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591] Data Load is partially success for table test_sort_col
17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$: [pallavi][pallavi][Thread-1]Data load is partially successful for default.test_sort_col
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|Pallavi| 25|
|  2|Geetika| 24|
+---+-------+---+ 

What is the correct behavior, should we add measures in sort_column to dictionary_include or should we modify the load flow to handle null values?


On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <[hidden email]> wrote:
Hi Community,

While working with the sort columns, I saw that if the column listed in the sort_column happens to have a null value in the data while sorting , then the row corresponding to that null value is eliminated from the result set. Is this correct behavior? Ideally null values should be sorted and listed on the top in the result set.

--
Regards | Pallavi Singh 
Software Consultant
Reply | Threaded
Open this post in threaded view
|

Re: Query regarding behaviour of sort column

manish gupta
Hi Pallavi,

Behavior of column should be same irrespective of whether it is dictionary, no dictionary, measure or sort column.

When you specify any numeric column as sort column, it is processed as no dictionary column.

In your case as the BAD_RECORDS_FORCE  is force so data load should not fail and consistent behavior should be displayed. However there is a bug in the code because of which some inconsistency is being encountered.

I will raise a jira for the same and track the issue.

Regards
Manish Gupta

On Wed, May 17, 2017 at 4:49 PM, Pallavi Singh <[hidden email]> wrote:
Hi Community,

While working with the above problem I found two discussions regarding the sort column 

1. https://github.com/apache/carbondata/pull/635 which states : If the table need be sorted by a measure, we should use dictionary_include to add it to dimension list.

2. https://github.com/apache/carbondata/pull/757 : if a column of SORT_COLUMNS is a measure before, now this column will be created as a dimension. And this dimension is a no-dicitonary column(Better to use other direct-dictionary).

Now if the columns in my sort_column be measures then I have to add the same columns in the dictionary_include other wise in case of null value in case of sort_column column the loading fails after the first null encounter itself. 

for example like this :
CREATE TABLE test_sort_col
| (id INT,
| name STRING,
| age INT
| )STORED BY 'org.apache.carbondata.format'
| TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')
and the csv has following data :
1,Pallavi,25
2,Geetika,24
3,Prabhat,twenty six
7,Neha,25
2,Geetika,22
3,Sangeeta,26
and the load gets successful like shown below :

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1| Pallavi|  25|
|  2| Geetika|  22|
|  2| Geetika|  24|
|  3| Prabhat|null|
|  3|Sangeeta|  26|
|  7|    Neha|  25|
+---+--------+----+

now if i remove the measures of the sort_column from the dictionary_include in the  query I get an error and partial data gets loaded, snapshot is provided below 

Data load request has been received for table default.test_sort_col
17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl: pool-20-thread-1 
java.lang.ClassCastException: java.lang.String cannot be cast to [B
at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
at org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150)
at org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder: [pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size : 2
17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591] Data Load is partially success for table test_sort_col
17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$: [pallavi][pallavi][Thread-1]Data load is partially successful for default.test_sort_col
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|Pallavi| 25|
|  2|Geetika| 24|
+---+-------+---+ 

What is the correct behavior, should we add measures in sort_column to dictionary_include or should we modify the load flow to handle null values?


On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <[hidden email]> wrote:
Hi Community,

While working with the sort columns, I saw that if the column listed in the sort_column happens to have a null value in the data while sorting , then the row corresponding to that null value is eliminated from the result set. Is this correct behavior? Ideally null values should be sorted and listed on the top in the result set.

--
Regards | Pallavi Singh 
Software Consultant