I gave a talk at Hadoop Summit San Jose 2016 about a file format benchmark that I’ve contributed as ORC-72. The benchmark focuses on real data sets that are publicly available. The data sets represent a wide variety of use cases:

  • NYC Taxi Data - very dense data with mostly numeric types
  • Github Archives - very sparse data with a lot of complex structure
  • Sales - a real production schema from a sales table with a synthetic generator

The benchmarks look at a set of three very common use cases:

  • Full table scan - read all columns and rows
  • Column projection - read some columns, but all of the rows
  • Column projection and predicate push down - read some columns and some rows

You can see the slides here:

File Format Benchmarks: Avro, JSON, ORC, & Parquet