ORC adds Gopal Vijayaraghavan to PMC

On behalf of the Apache ORC Project Management Committee (PMC), it gives me great pleasure to announce that Gopal Vijayaraghavan has joined the PMC. Gopal has done an amazing job at speeding up ORC in many ways.

Please join me in welcoming Gopal to the ORC PMC!

Congratulations Gopal!

ORC adds new committers

As part of the removal of the ORC code base from Hive, the ORC PMC has offered to make any existing Hive committers into ORC committers. The new ORC committers coming from Hive are:

  • Aihua Xu
  • Ashutosh Chauhan
  • Chaoyu Tang
  • Chinna Rao Lalam
  • Daniel Dai
  • Eugene Koifman
  • Ferdinand Xu
  • Jason Dere
  • Jesus Camacho Rodriguez
  • Lars Francke
  • Matthew McCline
  • Mithun Radhakrishnan
  • Pengcheng Xiong
  • Rajesh Balamohan
  • Rui Li
  • Sergio Pena
  • Siddharth Seth
  • Vaibhav Gumashta
  • Wei Zheng
  • Yongzhi Chen

ORC 1.2.3 Released

The ORC team is excited to announce the release of ORC v1.2.3. This release fixes some bugs in the Java schema evolution code.

The new features of ORC 1.2:

  • ORC-54 Evolve schemas based on field name rather than index
  • ORC-84 Create a separate java tool module.
  • ORC-77 and ORC-81 Implement LZO and LZ4 compression codecs.
  • ORC-92 Add support for nested column id selection in C++
  • ORC-69 Add batch option support in orc-scan tools.

Important fixes:

  • HIVE-14214 ORC schema evolution and predicate push down do not work together.

Known issues:

  • ORC-40 Predicate push down is not implemented in C++.

ORC 1.2.2 Released

The ORC team is excited to announce the release of ORC v1.2.2.

The new features of ORC 1.2:

  • ORC-54 Evolve schemas based on field name rather than index
  • ORC-84 Create a separate java tool module.
  • ORC-77 and ORC-81 Implement LZO and LZ4 compression codecs.
  • ORC-92 Add support for nested column id selection in C++
  • ORC-69 Add batch option support in orc-scan tools.

Important fixes:

  • HIVE-14214 ORC schema evolution and predicate push down do not work together.

Known issues:

  • ORC-40 Predicate push down is not implemented in C++.

ORC 1.2.1 Released

The ORC team is excited to announce the release of ORC v1.2.1.

The new features of ORC 1.2:

  • ORC-54 Evolve schemas based on field name rather than index
  • ORC-84 Create a separate java tool module.
  • ORC-77 and ORC-81 Implement LZO and LZ4 compression codecs.
  • ORC-92 Add support for nested column id selection in C++
  • ORC-69 Add batch option support in orc-scan tools.

Important fixes:

  • HIVE-14214 ORC schema evolution and predicate push down do not work together.

Known issues:

  • ORC-40 Predicate push down is not implemented in C++.

ORC 1.2.0 Released

The ORC team is excited to announce the release of ORC v1.2.0.

The new features of ORC 1.2:

  • ORC-54 Evolve schemas based on field name rather than index
  • ORC-84 Create a separate java tool module.
  • ORC-77 and ORC-81 Implement LZO and LZ4 compression codecs.
  • ORC-92 Add support for nested column id selection in C++
  • ORC-69 Add batch option support in orc-scan tools.

Important fixes:

  • HIVE-14214 ORC schema evolution and predicate push down do not work together.

Known issues:

  • ORC-40 Predicate push down is not implemented in C++.

  • ORC-101 Bloom filters for string and decimal use inconsistent encoding

ORC 1.1.2 Released

The ORC team is excited to announce the release of ORC v1.1.2. This release contains the Java reader and writer and the native C++ ORC reader and tools.

The major new features in ORC 1.1 are:

  • ORC-1 Copy the Java ORC code from Hive.
  • ORC-10 Fix the C++ reader to correctly read timestamps from timezones with different daylight savings rules.
  • ORC-52 Add mapred and mapreduce connectors.

Known issues:

  • HIVE-14214 Schema evolution and predicate pushdown don’t work together.

  • ORC-40 Predicate push down is not implemented in C++.

  • ORC-101 Bloom filters for string and decimal use inconsistent encoding

File format benchmark

I gave a talk at Hadoop Summit San Jose 2016 about a file format benchmark that I’ve contributed as ORC-72. The benchmark focuses on real data sets that are publically available. The data sets represent a wide variety of use cases:

  • NYC Taxi Data - very dense data with mostly numeric types
  • Github Archives - very sparse data with a lot of complex structure
  • Sales - a real production schema from a sales table with a synthetic generator

The benchmarks look at a set of three very common use cases:

  • Full table scan - read all columns and rows
  • Column projection - read some columns, but all of the rows
  • Column projection and predicate push down - read some columns and some rows

You can see the slides here:

File Format Benchmarks: Avro, JSON, ORC, & Parquet

ORC 1.1.1 Released

The ORC team is excited to announce the release of ORC v1.1.1. This release contains the Java reader and writer and the native C++ ORC reader and tools.

The major new features in ORC 1.1 are:

  • ORC-1 Copy the Java ORC code from Hive.
  • ORC-10 Fix the C++ reader to correctly read timestamps from timezones with different daylight savings rules.
  • ORC-52 Add mapred and mapreduce connectors.

Known issues:

  • HIVE-14214 Schema evolution and predicate pushdown don’t work together.

  • ORC-40 Predicate push down is not implemented in C++.

  • ORC-101 Bloom filters for string and decimal use inconsistent encoding

ORC 1.1.0 Released

The ORC team is excited to announce the release of ORC v1.1.0. This release contains the Java reader and writer and the native C++ ORC reader and tools.

Release Artifacts:

The major new features in ORC 1.1 are:

  • ORC-1 Copy the Java ORC code from Hive.
  • ORC-10 Fix the C++ reader to correctly read timestamps from timezones with different daylight savings rules.
  • ORC-52 Add mapred and mapreduce connectors.

Known issues:

  • HIVE-14214 Schema evolution and predicate pushdown don’t work together.

  • ORC-40 Predicate push down is not implemented in C++.

  • ORC-101 Bloom filters for string and decimal use inconsistent encoding

ORC 1.0.0 Released

The ORC team is excited to announce the release of ORC v1.0.0. This release contains the native C++ ORC reader and some tools.

The major features:

  • Portable pure C++ ORC reader
  • The C++ reader is known to work on:
    • CentOS and RHEL 5, 6, and 7
    • Debian 6 and 7
    • Ubuntu 12 and 14
    • Mac OS 10.10 and 10.11
  • A file-contents command that prints the contents of the file as json records.
  • A file-metadata command that prints the metadata of the file.
  • Docker files for building and testing on various Linux distributions.
  • Memory estimation for the reader.

Known issues:

  • ORC-1 We are still working on moving the Java reader and writer out of Hive’s code base and thus it is not included here.

  • ORC-10 When moving ORC files between timezones, different daylight savings rules will cause timestamps to shift in the C++ reader.

  • ORC-40 Predicate push down is not implemented in C++.

ORC adds Aliaksei Sandryhaila to PMC

On behalf of the Apache ORC Project Management Committee (PMC), it gives me great pleasure to announce that Aliaksei Sandryhaila has joined the Apache ORC PMC. He has done lot of good work on ORC and I’m looking forward to more.

Please join me in welcoming Aliaksei to ORC PMC!

Congratulations Aliaksei!

ORC adopts new logo

The ORC project has adopted a new logo. We hope you like it.

orc logo

Other great options included a big white hand on a black shield. smile

ORC adds 7 committers

The ORC project management committee today added seven new committers for their work on ORC. Welcome all!

  • Gunther Hagleitner
  • Aliaksei Sandryhaila
  • Sergey Shelukhin
  • Gopal Vijayaraghavan
  • Stephen Walkauskas
  • Kevin Wilfong
  • Xuefu Zhang

ORC becomes an Apache Top Level Project

Today Apache ORC became a top level project at the Apache Software Foundation. This step represents a major step forward for the project, and is representative of its momentum.

Back in January 2013, we created ORC files as part of the initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. We added it as a feature of Hive for two reasons:

  1. To ensure that it would be well integrated with Hive
  2. To ensure that storing data in ORC format would be as simple as stating “stored as ORC” to your table definition.

In the last two years, many of the features that we’ve added to Hive, such as vectorization, ACID, predicate push down and LLAP, support ORC first, and follow up with other storage formats later.

The growing use and acceptance of ORC has encouraged additional Hadoop execution engines, such as Apache Pig, Map-Reduce, Cascading, and Apache Spark to support reading and writing ORC. However, there are concerns that depending on the large Hive jar that contains ORC pulls in a lot of other projects that Hive depends on. To better support these non-Hive users, we decided to split off from Hive and become a separate project. This will not only allow us to support Hive, but also provide a much more streamlined jar, documentation and help for users outside of Hive.

Although Hadoop and its ecosystem are largely written in Java, there are a lot of applications in other languages that would like to natively access ORC files in HDFS. Hortonworks, HP, and Microsoft are developing a pure C++ ORC reader and writer that enables C++ applications to read and write ORC files efficiently without Java. That code will also be moved into Apache ORC and released together with the Java implementation.

ORC Twitter

The official @ApacheOrc Twitter account pushes announcements about ORC. If you give a talk about ORC, let us know and we'll tweet it out and add it to the news section of the website.