carbon data performance doubts

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

carbon data performance doubts

Swapnil Shinde
Hello All
     I am trying carbon data for the first time and having few question on improving performance - 

1. What is the use of carbon.number.of.cores property and how is it different from spark's executor cores?

2. Documentation says, by default, all non-numeric columns (except complex types) become dimensions and numeric columns become measure. How dimensions and measure columns are handled diferently? What are the pros and cons of keeping any column as dimension vs measure?

3. What is the best way when we have a ID INT column which is will be used heavily for filteration/agg/joins but can't be dimension by default. Documentation says to include these kind of numeric columns with "dictionay_include" or "dictionary_exclude" in table definition so that column will be considered as dimenstion. It is not supported to keep non-string data types as "dictionary_exclude" (link) Then do we have to enable dictionary encoding for ID INT columns which is beneficial to encode.

4. How MDK gets generated and how can we alter it? Any API to find out MDK for given table?

        It will be good to know to understand above concept in details so we can use carbon data effectively?