ORC Java configuration

Configuration properties

Key	Default	Notes
`orc.stripe.size`	67108864	Define the default ORC stripe size, in bytes.
`orc.stripe.row.count`	2147483647	This value limit the row count in one stripe. The number of stripe rows can be controlled at (0, "orc.stripe.row.count" + max(batchSize, "orc.rows.between.memory.checks"))
`orc.block.size`	268435456	Define the default file system block size for ORC files.
`orc.create.index`	true	Should the ORC writer create indexes as part of the file.
`orc.row.index.stride`	10000	Define the default ORC index stride in number of rows. (Stride is the number of rows an index entry represents.)
`orc.compress.size`	262144	Define the default ORC buffer size, in bytes.
`orc.base.delta.ratio`	8	The ratio of base writer and delta writer in terms of STRIPE_SIZE and BUFFER_SIZE.
`orc.block.padding`	true	Define whether stripes should be padded to the HDFS block boundaries.
`orc.compress`	ZSTD	Define the default compression codec for ORC file. It can be NONE, ZLIB, SNAPPY, LZO, LZ4, ZSTD, BROTLI.
`orc.write.format`	0.12	Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the run length encoding (RLE) introduced in Hive 0.12.
`orc.buffer.size.enforce`	false	Defines whether to enforce ORC compression buffer size.
`orc.encoding.strategy`	SPEED	Define the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB).
`orc.compression.strategy`	SPEED	Define the compression strategy to use while writing data. This changes the compression level of higher level compression codec (like ZLIB).
`orc.compression.zstd.level`	3	Define the compression level to use with ZStandard codec while writing data. The valid range is 1~22.
`orc.compression.zstd.windowlog`	0	Set the maximum allowed back-reference distance for ZStandard codec, expressed as power of 2.
`orc.block.padding.tolerance`	0.05	Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, the default block padding tolerance of 5% will reserve a maximum of 3.2Mb for padding within the 256Mb block. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will block boundaries and cause remote reads within a node local task.
`orc.bloom.filter.fpp`	0.01	Define the default false positive probability for bloom filters.
`orc.use.zerocopy`	false	Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.)
`orc.skip.corrupt.data`	false	If ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw exception. The default behavior is to throw exception.
`orc.tolerate.missing.schema`	true	Writers earlier than HIVE-4243 may have inaccurate schema metadata. This setting will enable best effort schema evolution rather than rejecting mismatched schemas
`orc.memory.pool`	0.5	Maximum fraction of heap that can be used by ORC file writers
`orc.dictionary.key.threshold`	0.8	If the number of distinct keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding. Use 1 to always use dictionary encoding.
`orc.dictionary.early.check`	true	If enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter.
`orc.dictionary.implementation`	rbtree	the implementation for the dictionary used for string-type column encoding. The choices are: rbtree - use red-black tree as the implementation for the dictionary. hash - use hash table as the implementation for the dictionary.
`orc.bloom.filter.columns`		List of columns to create bloom filters for when writing.
`orc.bloom.filter.write.version`	utf8	(Deprecated) Which version of the bloom filters should we write. The choices are: original - writes two versions of the bloom filters for use by both old and new readers. utf8 - writes just the new bloom filters.
`orc.bloom.filter.ignore.non-utf8`	false	Should the reader ignore the obsolete non-UTF8 bloom filters.
`orc.max.file.length`	9223372036854775807	The maximum size of the file to read for finding the file tail. This is primarily used for streaming ingest to read intermediate footers while the file is still open
`orc.mapred.input.schema`	null	The schema that the user desires to read. The values are interpreted using TypeDescription.fromString.
`orc.mapred.map.output.key.schema`	null	The schema of the MapReduce shuffle key. The values are interpreted using TypeDescription.fromString.
`orc.mapred.map.output.value.schema`	null	The schema of the MapReduce shuffle value. The values are interpreted using TypeDescription.fromString.
`orc.mapred.output.schema`	null	The schema that the user desires to write. The values are interpreted using TypeDescription.fromString.
`orc.include.columns`	null	The list of comma separated column ids that should be read with 0 being the first column, 1 being the next, and so on. .
`orc.kryo.sarg`	null	The kryo and base64 encoded SearchArgument for predicate pushdown.
`orc.kryo.sarg.buffer`	8192	The kryo buffer size for SearchArgument for predicate pushdown.
`orc.sarg.column.names`	null	The list of column names for the SearchArgument.
`orc.force.positional.evolution`	false	Require schema evolution to match the top level columns using position rather than column names. This provides backwards compatibility with Hive 2.1.
`orc.force.positional.evolution.level`	1	Require schema evolution to match the defined no. of level columns using position rather than column names. This provides backwards compatibility with Hive 2.1.
`orc.rows.between.memory.checks`	5000	How often should MemoryManager check the memory sizes? Measured in rows added to all of the writers. Valid range is [1,10000] and is primarily meant fortesting. Setting this too low may negatively affect performance. Use orc.stripe.row.count instead if the value larger than orc.stripe.row.count.
`orc.overwrite.output.file`	false	A boolean flag to enable overwriting of the output file if it already exists.
`orc.schema.evolution.case.sensitive`	true	A boolean flag to determine if the comparision of field names in schema evolution is case sensitive .
`orc.sarg.to.filter`	false	A boolean flag to determine if a SArg is allowed to become a filter
`orc.filter.use.selected`	false	A boolean flag to determine if the selected vector is supported by the reading application. If false, the output of the ORC reader must have the filter reapplied to avoid using unset values in the unselected rows. If unsure please leave this as false.
`orc.filter.plugin`	false	Enables the use of plugin filters during read. The plugin filters are discovered against the service org.apache.orc.filter.PluginFilterService, if multiple filters are determined, they are combined using AND. The order of application is non-deterministic and the filter functionality should not depend on the order of application.
`orc.filter.plugin.allowlist`	*	A list of comma-separated class names. If specified it restricts the PluginFilters to just these classes as discovered by the PluginFilterService. The default of * allows all discovered classes and an empty string would not allow any plugins to be applied.
`orc.write.variable.length.blocks`	false	A boolean flag whether the ORC writer should write variable length HDFS blocks.
`orc.column.encoding.direct`		Comma-separated list of columns for which dictionary encoding is to be skipped.
`orc.max.disk.range.chunk.limit`	2147482623	When reading stripes >2GB, specify max limit for the chunk size.
`orc.min.disk.seek.size`	0	When determining contiguous reads, gaps within this size are read contiguously and not seeked. Default value of zero disables this optimization
`orc.min.disk.seek.size.tolerance`	0.0	Define the tolerance for extra bytes read as a result of orc.min.disk.seek.size. If the (bytesRead - bytesNeeded) / bytesNeeded is greater than this threshold then extra work is performed to drop the extra bytes from memory after the read.
`orc.encrypt`	null	The list of keys and columns to encrypt with
`orc.mask`	null	The masks to apply to the encrypted columns
`orc.key.provider`	hadoop	The kind of KeyProvider to use for encryption.
`orc.proleptic.gregorian`	false	Should we read and write dates & times using the proleptic Gregorian calendar instead of the hybrid Julian Gregorian? Hive before 3.1 and Spark before 3.0 used hybrid.
`orc.proleptic.gregorian.default`	false	This value controls whether pre-ORC 27 files are using the hybrid or proleptic calendar. Only Hive 3.1 and the C++ library wrote using the proleptic, so hybrid is the default.
`orc.row.batch.size`	1024	The number of rows to include in an orc vectorized reader batch. The value should be carefully chosen to minimize overhead and avoid OOMs in reading data.
`orc.row.child.limit`	32768	The maximum number of child elements to buffer before the ORC row writer writes the batch to the file.

Back