Using Core C++

The C++ Core ORC API reads and writes ORC files into its own orc::ColumnVectorBatch vectorized classes.

Vectorized Row Batch

Data is passed to ORC as instances of orc::ColumnVectorBatch that contain the data a batch of rows. The focus is on speed and accessing the data fields directly. numElements is the number of rows. ColumnVectorBatch is the parent type of the different kinds of columns and has some fields that are shared across all of the column types. In particular, the hasNulls flag if there is any null in this column for this batch. For columns where hasNulls == true the notNull buffer is false if that value is null.

namespace orc {
  struct ColumnVectorBatch {
    uint64_t numElements;
    DataBuffer<char> notNull;
    bool hasNulls;
    ...
  }
}

The subtypes of ColumnVectorBatch are:

ORC Type ColumnVectorBatch
array ListVectorBatch
binary StringVectorBatch
bigint LongVectorBatch
boolean LongVectorBatch
char StringVectorBatch
date LongVectorBatch
decimal Decimal64VectorBatch, Decimal128VectorBatch
double DoubleVectorBatch
float DoubleVectorBatch
int LongVectorBatch
map MapVectorBatch
smallint LongVectorBatch
string StringVectorBatch
struct StructVectorBatch
timestamp TimestampVectorBatch
tinyint LongVectorBatch
uniontype UnionVectorBatch
varchar StringVectorBatch

LongVectorBatch handles all of the integer types (boolean, bigint, date, int, smallint, and tinyint). The data is represented as a buffer of int64_t where each value is sign-extended as necessary.

  struct LongVectorBatch: public ColumnVectorBatch {
    DataBuffer<int64_t> data;
    ...
  };

TimestampVectorBatch handles timestamp values. The data is represented as two buffers of int64_t for seconds and nanoseconds respectively. Note that we always assume data is in GMT timezone; therefore it is user’s responsibility to convert wall clock time from local timezone to GMT.

  struct TimestampVectorBatch: public ColumnVectorBatch {
    DataBuffer<int64_t> data;
    DataBuffer<int64_t> nanoseconds;
    ...
  };

DoubleVectorBatch handles all of the floating point types (double, and float). The data is represented as a buffer of doubles.

  struct DoubleVectorBatch: public ColumnVectorBatch {
    DataBuffer<double> data;
    ...
  };

Decimal64VectorBatch handles decimal columns with precision no greater than 18. Decimal128VectorBatch handles the others. The data is represented as a buffer of int64_t and orc::Int128 respectively.

  struct Decimal64VectorBatch: public ColumnVectorBatch {
    DataBuffer<int64_t> values;
    ...
  };

  struct Decimal128VectorBatch: public ColumnVectorBatch {
    DataBuffer<Int128> values;
    ...
  };

StringVectorBatch handles all of the binary types (binary, char, string, and varchar). The data is represented as a char* buffer, and a length buffer.

  struct StringVectorBatch: public ColumnVectorBatch {
    DataBuffer<char*> data;
    DataBuffer<int64_t> length;
    ...
  };

StructVectorBatch handles the struct columns and represents the data as a buffer of ColumnVectorBatch.

  struct StructVectorBatch: public ColumnVectorBatch {
    std::vector<ColumnVectorBatch*> fields;
    ...
  };

UnionVectorBatch handles the union columns. It uses tags to indicate which subtype has the value and offsets indicates the offset in child batch of that subtype. A individual ColumnVectorBatch is used for each subtype.

  struct UnionVectorBatch: public ColumnVectorBatch {
    DataBuffer<unsigned char> tags;
    DataBuffer<uint64_t> offsets;
    std::vector<ColumnVectorBatch*> children;
    ...
  };

ListVectorBatch handles the array columns and represents the data as a buffer of integers for the offsets and a ColumnVectorBatch for the children values.

  struct ListVectorBatch: public ColumnVectorBatch {
    DataBuffer<int64_t> offsets;
    std::unique_ptr<ColumnVectorBatch> elements;
    ...
  };

MapVectorBatch handles the map columns and represents the data as two arrays of integers for the offsets and two ColumnVectorBatchs for the keys and values.

  struct MapVectorBatch: public ColumnVectorBatch {
    DataBuffer<int64_t> offsets;
    std::unique_ptr<ColumnVectorBatch> keys;
    std::unique_ptr<ColumnVectorBatch> elements;
    ...
  };

Writing ORC Files

To write an ORC file, you need to include OrcFile.hh and define the schema; then use orc::OutputStream and orc::WriterOptions to create a orc::Writer with the desired filename. This example sets the required schema parameter, but there are many other options to control the ORC writer.

std::unique_ptr<OutputStream> outStream =
  writeLocalFile("my-file.orc");
std::unique_ptr<Type> schema(
  Type::buildTypeFromString("struct<x:int,y:int>"));
WriterOptions options;
std::unique_ptr<Writer> writer =
  createWriter(*schema, outStream.get(), options);

Now you need to create a row batch, set the data, and write it to the file as the batch fills up. When the file is done, close the Writer.

uint64_t batchSize = 1024, rowCount = 10000;
std::unique_ptr<ColumnVectorBatch> batch =
  writer->createRowBatch(batchSize);
StructVectorBatch *root =
  dynamic_cast<StructVectorBatch *>(batch.get());
LongVectorBatch *x =
  dynamic_cast<LongVectorBatch *>(root->fields[0]);
LongVectorBatch *y =
  dynamic_cast<LongVectorBatch *>(root->fields[1]);

uint64_t rows = 0;
for (uint64_t i = 0; i < rowCount; ++i) {
  x->data[rows] = i;
  y->data[rows] = i * 3;
  rows++;

  if (rows == batchSize) {
    root->numElements = rows;
    x->numElements = rows;
    y->numElements = rows;

    writer->add(*batch);
    rows = 0;
  }
}

if (rows != 0) {
  root->numElements = rows;
  x->numElements = rows;
  y->numElements = rows;

  writer->add(*batch);
  rows = 0;
}

writer->close();

Reading ORC Files

To read ORC files, include OrcFile.hh file to create a orc::Reader that contains the metadata about the file. There are a few options to the orc::Reader, but far fewer than the writer and none of them are required. The reader has methods for getting the number of rows, schema, compression, etc. from the file.

std::unique_ptr<InputStream> inStream =
  readLocalFile("my-file.orc");
ReaderOptions options;
std::unique_ptr<Reader> reader =
  createReader(inStream, options);

To get the data, create a orc::RowReader object. By default, the RowReader reads all rows and all columns, but there are options to control the data that is read.

RowReaderOptions rowReaderOptions;
std::unique_ptr<RowReader> rowReader =
  reader->createRowReader(rowReaderOptions);
std::unique_ptr<ColumnVectorBatch> batch =
  rowReader->createRowBatch(1024);

With a orc::RowReader the user can ask for the next batch until there are no more left. The reader will stop the batch at certain boundaries, so the returned batch may not be full, but it will always contain some rows.

while (rowReader->next(*batch)) {
  for (uint64_t r = 0; r < batch->numElements; ++r) {
    ... process row r from batch
  }
}