Hi, Apache CarbonData community is pleased to announce the release of the Version 1.5.0 in The Apache Software Foundation (ASF). CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookups on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenarios, it supports queries on a single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds! We encourage you to use the release https://dist.apache.org/repos/dist/release/carbondata/1.5.0/, and [hidden email]! This release note provides information on the new features, improvements, and bug fixes of this release. What’s New in CarbonData Version 1.5.0?CarbonData 1.5.0 intention was to move closer to unified analytics. We want to enable CarbonData files to be read from more engines/libraries to support various use cases. In this regard, we have added support to read CarbonData files from c++ libraries. Additionally, CarbonData files can be read using Java SDK, Spark FileFormat interface, Spark, Presto. CarbonData added multiple optimisations to reduce the store size so that query can take advantage of lesser IO. Several enhancements have been made to Streaming support from CarbonData. In this version of CarbonData, more than 150 JIRA tickets related to new features, improvements, and bugs have been resolved. Following are the summary. Ecosystem IntegrationSupport Spark 2.3.2 ecosystem integrationNow CarbonData supports Spark 2.3.2 Spark 2.3.2 has many performance improvements in addition to critical bug fixes. Spark 2.3.2 has many improvements related to Streaming and unification of interfaces. In 1.5.0 version, CarbonData integrated with Spark so that future versions of CarbonData can add enhancements based on Spark's new and improved capabilities. Support Hadoop 3.1.1 ecosystem integrationNow CarbonData supports Hadoop 3.1.1 which is the latest and stable hadoop version and supports many new features.(EC, federation cluster etc.) LightWeight Integration with SparkCarbonData now supports the Spark FileFormat Data Source APIs so that CarbonData can be integrated to Spark as an external file source. This integration helps to query CarbonData tables from SparkSession, it also helps applications which needs standard compliance's with respect to interfaces. Spark data source APIs support file format level operations such as read and write. CarbonData’s enhanced features namely IUD, Alter, Compaction, Segment Management, Streaming will not be available to use when CarbonData is integrated as a Spark’s data source through the data source API. CarbonData CoreAdaptive Encoding for Numeric ColumnsCarbonData now supports adaptive encoding for numeric columns. Adaptive encoding helps to store each data of a column as a delta of Min/Max value of that column, there by reducing the effective bits required to store the value. This results in smaller store size there by increasing the query performance due to lesser IO. Adaptive encoding for dictionary columns is already supported from version 1.1.0, now supports for all numeric columns. Performance improvement measurement is not complete in 1.5.0. The results will be published along with 1.5.1 release. Configurable Column Size for Generating Min/MaxCarbonData generates Min/Max index for all columns and uses it for effective pruning of data while querying. Generating Min/Max for columns having longer width(like address column) will lead to increased storage size, increased memory footprint there by reducing the query performance. Moreover filters are not applied on such columns and hence there is no necessity of generating the indexes; or the filters on such columns are very minimal and would be wise to have lower query performance in such scenarios, rather than affecting the over all performance for other filter scenarios due to increased index size. CarbonData now supports configuring the limit of the column width(in terms of characters) beyond which the Min/Max generation would be skipped. By Default the Min/Max is generated for all string columns. Users who are aware of they data schema and know the columns which have more number of characters and on which filters will not be applied upon, can configure the exclude such columns; or the maximum length of characters upto which the Min/Max can be generated can be specified so that CarbonData would skip Min/Max index generation if the column character length crosses this configured threshold. By default string columns with more than 200 bytes are skipped from Min/Max index generation. In Java each character occupies 2 characters.Hence column length greater than 100 characters are skipped from Min/Max index generation. Support for Map Complex Data TypeCarbonData has integrated map complex data type support. Map data schema defined in Avro can be stored into CarbonData tables. Map data types help for an efficient look up of data. Adding Map complex data type support CarbonData helps the user to directly store their Avro data without writing the conversion logic into CarbonData supported data types. Support for Byte and Float Data TypesCarbonData supports Byte and Float data types so that the data types defined in Avro schema can be stored into CarbonData tables. Columns of Byte data type can be included in sort columns. ZSTD CompressionZSTD compression is supported to compress each page of CarbonData file. ZSTD offers better compression ratio there by reducing the store size. On the average ZSTD compression reduces store size by 20-30% . ZSTD compression is supported to compress sort temp files written during data loading. CarbonData SDKSDK Supports C++ Interfaces to read CarbonData filesTo enable integration with non java based execution engines, CarbonData supports C++ reader to read the CarbonData files. These readers can be integrated with any execution engine and queried for data stored in CarbonData tables without the dependency on Spark or Hadoop. Multi-Thread Safe Writer API in SDKTo improve the write performance when using SDK, CarbonData supports multi-thread safe writer APIs. This enables the applications to write data to a single CarbonData file in parallel. Multi-Thread safe writers help in generating bigger CarbonData files there by avoiding the small files problem faced in HDFS. StreamingStreamSQL supports Kafka as streaming sourceStreamSQL DDL now supports specifying Kafka as streaming source. With this support, users need not write custom application to ingest streaming data from Kafka into CarbonData. They can easily do so by specifying 'format' as 'kafka' in CREATE TABLE DDL. StreamSQL supports Json records from Kafka/socket streaming sourcesNow StreamSQL can accept Json as data format in addition to csv. This helps the users not to write their custom applications to ingest streaming data into CarbonData. Min/Max Index Support for Streaming SegmentCarbonData supports generating Min/Max indexes for Streaming segment so that filter pruning is more efficient and increases the query performance. CarbonData is able to serve the queries faster due to the Min/Max indexes built at various levels. Adding Min/Max index support to Stream segment will enable CarbonData to serve the queries with same performance as other columnar segments. Debugging and Maintenance enhancementsData Summary ToolCarbonData supports a CLI tool to retrieve the statistical information from each CarbonData file.It can list various parameters like number of blocklets, pages, encoding types, Min/Max indexes. This tool is useful to identify the reason for a block/blocklet selection during pruning.Looking at the Min/Max indexes, user can easily decide the size of blocklet so as to avoid false positives. Scan performance benchmarking is supported from this tool. User can use this to identify the time taken to scan each blocklet for a particular column. Other Improvements
Behavioral ChangesRenaming of Table NamesEarlier renaming of CarbonData table used to rename in Hive metastore as well as folder name on HDFS. Now, it will be renamed only in Hive metastore. Changed Configuration Default Values
New Configuration Parameters
Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12341006 Sub-task
Bug
New Feature
Improvement
Task
Thanks & Regards, Ravindra |
Free forum by Nabble | Edit this page |