File format benchmark

File format benchmark ∞

talk

28 Jun 2016 omalley

I gave a talk at Hadoop Summit San Jose 2016 about a file format benchmark that I’ve contributed as ORC-72. The benchmark focuses on real data sets that are publicly available. The data sets represent a wide variety of use cases:

NYC Taxi Data - very dense data with mostly numeric types
Github Archives - very sparse data with a lot of complex structure
Sales - a real production schema from a sales table with a synthetic generator

The benchmarks look at a set of three very common use cases:

Full table scan - read all columns and rows
Column projection - read some columns, but all of the rows
Column projection and predicate push down - read some columns and some rows

You can see the slides here:

File Format Benchmarks: Avro, JSON, ORC, & Parquet