Today Apache ORC became a top level project at the Apache Software Foundation. This step represents a major step forward for the project, and is representative of its momentum.
Back in January 2013, we created ORC files as part of the initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. We added it as a feature of Hive for two reasons:
- To ensure that it would be well integrated with Hive
- To ensure that storing data in ORC format would be as simple as stating “stored as ORC” to your table definition.
In the last two years, many of the features that we’ve added to Hive, such as vectorization, ACID, predicate push down and LLAP, support ORC first, and follow up with other storage formats later.
The growing use and acceptance of ORC has encouraged additional Hadoop execution engines, such as Apache Pig, Map-Reduce, Cascading, and Apache Spark to support reading and writing ORC. However, there are concerns that depending on the large Hive jar that contains ORC pulls in a lot of other projects that Hive depends on. To better support these non-Hive users, we decided to split off from Hive and become a separate project. This will not only allow us to support Hive, but also provide a much more streamlined jar, documentation and help for users outside of Hive.
Although Hadoop and its ecosystem are largely written in Java, there are a lot of applications in other languages that would like to natively access ORC files in HDFS. Hortonworks, HP, and Microsoft are developing a pure C++ ORC reader and writer that enables C++ applications to read and write ORC files efficiently without Java. That code will also be moved into Apache ORC and released together with the Java implementation.