If your company or tool uses ORC, please let us know so that we can update this page.
ORC files have always supporting reading and writing from Hadoop’s MapReduce, but with the ORC 1.1.0 release it is now easier than ever without pulling in Hive’s exec jar and all of its dependencies. OrcStruct now also implements WritableComparable and can be serialized through the MapReduce shuffle.
Apache Spark has added support for reading and writing ORC files with support for column project and predicate push down.
Apache Arrow supports reading and writing ORC file format.
Apache Flink supports ORC format in Table API for reading and writing ORC files.
Apache Iceberg supports ORC spec to use ORC tables.
Apache Druid supports ORC extension to ingest and understand the Apache ORC data format.
Apache Hive was the original use case and home for ORC. ORC’s strong type system, advanced compression, column projection, predicate push down, and vectorization support make Hive perform better than any other format for your data.
Apache Impala supports reading from ORC format Hive tables by leveraging the ORC C++ library.
Apache Gobblin supports writing data to ORC files by leveraging Apache Hive’s SerDe library.
Apache Nifi is adding support for writing ORC files.
Apache Pig added support for reading and writing ORC files in Pig 14.0.
EEL is a Scala BigData API that supports reading and writing data for various file formats and storage systems including to and from ORC. It is designed as a in-process low level API for manipulating data. Data is lazily streamed from source to sink and using standard Scala operations such as map, flatMap and filter, it is especially suited for ETL style applications. EEL supports ORC predicate and projection pushdowns and correct handles conversions from other formats including complex types such as maps, lists or nested structs. A typical use case would be to extract data from JDBC to ORC files housed in HDFS, or directly into Hive tables backed by an ORC file format.
With more than 300 PB of data, Facebook was an early adopter of ORC and quickly put it into production.
LinkedIn uses the ORC file format with Apache Iceberg metadata catalog and Apache Gobblin to provide our data customers with high-query performance.
Trino (formerly Presto SQL)
The Trino team has done a lot of work integrating ORC into their SQL engine.
Timber adopted ORC for it’s S3 based logging platform that stores petabytes of log data. ORC has been key in ensuring a fast, cost-effective strategy for persisting and querying that data.
HPE Vertica has contributed significantly to the ORC C++ library. ORC is a significant part of Vertica SQL-on-Hadoop (VSQLoH) which brings the performance, reliability and standards compliance of the Vertica Analytic Database to the Hadoop ecosystem.