ORC Java configuration

Configuration properties

Key Default Notes
orc.stripe.size 67108864 Define the default ORC stripe size, in bytes.
orc.stripe.row.count 2147483647 This value limit the row count in one stripe. The number of stripe rows can be controlled at (0, "orc.stripe.row.count" + max(batchSize, "orc.rows.between.memory.checks"))
orc.block.size 268435456 Define the default file system block size for ORC files.
orc.create.index true Should the ORC writer create indexes as part of the file.
orc.row.index.stride 10000 Define the default ORC index stride in number of rows. (Stride is the number of rows an index entry represents.)
orc.compress.size 262144 Define the default ORC buffer size, in bytes.
orc.base.delta.ratio 8 The ratio of base writer and delta writer in terms of STRIPE_SIZE and BUFFER_SIZE.
orc.block.padding true Define whether stripes should be padded to the HDFS block boundaries.
orc.compress ZSTD Define the default compression codec for ORC file. It can be NONE, ZLIB, SNAPPY, LZO, LZ4, ZSTD, BROTLI.
orc.write.format 0.12 Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the run length encoding (RLE) introduced in Hive 0.12.
orc.buffer.size.enforce false Defines whether to enforce ORC compression buffer size.
orc.encoding.strategy SPEED Define the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB).
orc.compression.strategy SPEED Define the compression strategy to use while writing data. This changes the compression level of higher level compression codec (like ZLIB).
orc.compression.zstd.level 3 Define the compression level to use with ZStandard codec while writing data. The valid range is 1~22.
orc.compression.zstd.windowlog 0 Set the maximum allowed back-reference distance for ZStandard codec, expressed as power of 2.
orc.block.padding.tolerance 0.05 Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, the default block padding tolerance of 5% will reserve a maximum of 3.2Mb for padding within the 256Mb block. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will block boundaries and cause remote reads within a node local task.
orc.bloom.filter.fpp 0.01 Define the default false positive probability for bloom filters.
orc.use.zerocopy false Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.)
orc.skip.corrupt.data false If ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw exception. The default behavior is to throw exception.
orc.tolerate.missing.schema true Writers earlier than HIVE-4243 may have inaccurate schema metadata. This setting will enable best effort schema evolution rather than rejecting mismatched schemas
orc.memory.pool 0.5 Maximum fraction of heap that can be used by ORC file writers
orc.dictionary.key.threshold 0.8 If the number of distinct keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding. Use 1 to always use dictionary encoding.
orc.dictionary.early.check true If enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter.
orc.dictionary.implementation rbtree the implementation for the dictionary used for string-type column encoding. The choices are: rbtree - use red-black tree as the implementation for the dictionary. hash - use hash table as the implementation for the dictionary.
orc.bloom.filter.columns List of columns to create bloom filters for when writing.
orc.bloom.filter.write.version utf8 (Deprecated) Which version of the bloom filters should we write. The choices are: original - writes two versions of the bloom filters for use by both old and new readers. utf8 - writes just the new bloom filters.
orc.bloom.filter.ignore.non-utf8 false Should the reader ignore the obsolete non-UTF8 bloom filters.
orc.max.file.length 9223372036854775807 The maximum size of the file to read for finding the file tail. This is primarily used for streaming ingest to read intermediate footers while the file is still open
orc.mapred.input.schema null The schema that the user desires to read. The values are interpreted using TypeDescription.fromString.
orc.mapred.map.output.key.schema null The schema of the MapReduce shuffle key. The values are interpreted using TypeDescription.fromString.
orc.mapred.map.output.value.schema null The schema of the MapReduce shuffle value. The values are interpreted using TypeDescription.fromString.
orc.mapred.output.schema null The schema that the user desires to write. The values are interpreted using TypeDescription.fromString.
orc.include.columns null The list of comma separated column ids that should be read with 0 being the first column, 1 being the next, and so on. .
orc.kryo.sarg null The kryo and base64 encoded SearchArgument for predicate pushdown.
orc.kryo.sarg.buffer 8192 The kryo buffer size for SearchArgument for predicate pushdown.
orc.sarg.column.names null The list of column names for the SearchArgument.
orc.force.positional.evolution false Require schema evolution to match the top level columns using position rather than column names. This provides backwards compatibility with Hive 2.1.
orc.force.positional.evolution.level 1 Require schema evolution to match the defined no. of level columns using position rather than column names. This provides backwards compatibility with Hive 2.1.
orc.rows.between.memory.checks 5000 How often should MemoryManager check the memory sizes? Measured in rows added to all of the writers. Valid range is [1,10000] and is primarily meant fortesting. Setting this too low may negatively affect performance. Use orc.stripe.row.count instead if the value larger than orc.stripe.row.count.
orc.overwrite.output.file false A boolean flag to enable overwriting of the output file if it already exists.
orc.schema.evolution.case.sensitive true A boolean flag to determine if the comparision of field names in schema evolution is case sensitive .
orc.sarg.to.filter false A boolean flag to determine if a SArg is allowed to become a filter
orc.filter.use.selected false A boolean flag to determine if the selected vector is supported by the reading application. If false, the output of the ORC reader must have the filter reapplied to avoid using unset values in the unselected rows. If unsure please leave this as false.
orc.filter.plugin false Enables the use of plugin filters during read. The plugin filters are discovered against the service org.apache.orc.filter.PluginFilterService, if multiple filters are determined, they are combined using AND. The order of application is non-deterministic and the filter functionality should not depend on the order of application.
orc.filter.plugin.allowlist * A list of comma-separated class names. If specified it restricts the PluginFilters to just these classes as discovered by the PluginFilterService. The default of * allows all discovered classes and an empty string would not allow any plugins to be applied.
orc.write.variable.length.blocks false A boolean flag whether the ORC writer should write variable length HDFS blocks.
orc.column.encoding.direct Comma-separated list of columns for which dictionary encoding is to be skipped.
orc.max.disk.range.chunk.limit 2147482623 When reading stripes >2GB, specify max limit for the chunk size.
orc.min.disk.seek.size 0 When determining contiguous reads, gaps within this size are read contiguously and not seeked. Default value of zero disables this optimization
orc.min.disk.seek.size.tolerance 0.0 Define the tolerance for extra bytes read as a result of orc.min.disk.seek.size. If the (bytesRead - bytesNeeded) / bytesNeeded is greater than this threshold then extra work is performed to drop the extra bytes from memory after the read.
orc.encrypt null The list of keys and columns to encrypt with
orc.mask null The masks to apply to the encrypted columns
orc.key.provider hadoop The kind of KeyProvider to use for encryption.
orc.proleptic.gregorian false Should we read and write dates & times using the proleptic Gregorian calendar instead of the hybrid Julian Gregorian? Hive before 3.1 and Spark before 3.0 used hybrid.
orc.proleptic.gregorian.default false This value controls whether pre-ORC 27 files are using the hybrid or proleptic calendar. Only Hive 3.1 and the C++ library wrote using the proleptic, so hybrid is the default.
orc.row.batch.size 1024 The number of rows to include in an orc vectorized reader batch. The value should be carefully chosen to minimize overhead and avoid OOMs in reading data.
orc.row.child.limit 32768 The maximum number of child elements to buffer before the ORC row writer writes the batch to the file.