Class RunLengthIntegerWriterV2

java.lang.Object
org.apache.orc.impl.RunLengthIntegerWriterV2
All Implemented Interfaces:
IntegerWriter

public class RunLengthIntegerWriterV2 extends Object implements IntegerWriter

A writer that performs light weight compression over sequence of integers.

There are four types of lightweight integer compression

  • SHORT_REPEAT
  • DIRECT
  • PATCHED_BASE
  • DELTA

The description and format for these types are as below: SHORT_REPEAT: Used for short repeated integer sequences.

  • 1 byte header
    • 2 bits for encoding type
    • 3 bits for bytes required for repeating value
    • 3 bits for repeat count (MIN_REPEAT + run length)
  • Blob - repeat value (fixed bytes)

DIRECT: Used for random integer sequences whose number of bit requirement doesn't vary a lot.

  • 2 byte header (1st byte)
    • 2 bits for encoding type
    • 5 bits for fixed bit width of values in blob
    • 1 bit for storing MSB of run length
  • 2nd byte
    • 8 bits for lower run length bits
  • Blob - stores the direct values using fixed bit width. The length of the data blob is (fixed width * run length) bits long

PATCHED_BASE: Used for random integer sequences whose number of bit requirement varies beyond a threshold.

  • 4 bytes header (1st byte)
    • 2 bits for encoding type
    • 5 bits for fixed bit width of values in blob
    • 1 bit for storing MSB of run length
  • 2nd byte
    • 8 bits for lower run length bits
  • 3rd byte
    • 3 bits for bytes required to encode base value
    • 5 bits for patch width
  • 4th byte
    • 3 bits for patch gap width
    • 5 bits for patch length
  • Base value - Stored using fixed number of bytes. If MSB is set, base value is negative else positive. Length of base value is (base width * 8) bits.
  • Data blob - Base reduced values as stored using fixed bit width. Length of data blob is (fixed width * run length) bits.
  • Patch blob - Patch blob is a list of gap and patch value. Each entry in the patch list is (patch width + patch gap width) bits long. Gap between the subsequent elements to be patched are stored in upper part of entry whereas patch values are stored in lower part of entry. Length of patch blob is ((patch width + patch gap width) * patch length) bits.

DELTA Used for monotonically increasing or decreasing sequences, sequences with fixed delta values or long repeated sequences.

  • 2 bytes header (1st byte)
    • 2 bits for encoding type
    • 5 bits for fixed bit width of values in blob
    • 1 bit for storing MSB of run length
  • 2nd byte
    • 8 bits for lower run length bits
  • Base value - zigzag encoded value written as varint
  • Delta base - zigzag encoded value written as varint
  • Delta blob - only positive values. monotonicity and orderness are decided based on the sign of the base value and delta base