Key |
Default |
Notes |
orc.stripe.size |
67108864 |
Define the default ORC stripe size, in bytes.
|
orc.stripe.row.count |
2147483647 |
This value limit the row count in one stripe. The number of stripe rows can be controlled at (0, "orc.stripe.row.count" + max(batchSize, "orc.rows.between.memory.checks"))
|
orc.block.size |
268435456 |
Define the default file system block size for ORC files.
|
orc.create.index |
true |
Should the ORC writer create indexes as part of the file.
|
orc.row.index.stride |
10000 |
Define the default ORC index stride in number of rows. (Stride is the number of rows an index entry represents.)
|
orc.compress.size |
262144 |
Define the default ORC buffer size, in bytes.
|
orc.base.delta.ratio |
8 |
The ratio of base writer and delta writer in terms of STRIPE_SIZE and BUFFER_SIZE.
|
orc.block.padding |
true |
Define whether stripes should be padded to the HDFS block boundaries.
|
orc.compress |
ZSTD |
Define the default compression codec for ORC file. It can be NONE, ZLIB, SNAPPY, LZO, LZ4, ZSTD, BROTLI.
|
orc.write.format |
0.12 |
Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the run length encoding (RLE) introduced in Hive 0.12.
|
orc.buffer.size.enforce |
false |
Defines whether to enforce ORC compression buffer size.
|
orc.encoding.strategy |
SPEED |
Define the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB).
|
orc.compression.strategy |
SPEED |
Define the compression strategy to use while writing data. This changes the compression level of higher level compression codec (like ZLIB).
|
orc.compression.zstd.level |
3 |
Define the compression level to use with ZStandard codec while writing data. The valid range is 1~22.
|
orc.compression.zstd.windowlog |
0 |
Set the maximum allowed back-reference distance for ZStandard codec, expressed as power of 2.
|
orc.block.padding.tolerance |
0.05 |
Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, the default block padding tolerance of 5% will reserve a maximum of 3.2Mb for padding within the 256Mb block. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will block boundaries and cause remote reads within a node local task.
|
orc.bloom.filter.fpp |
0.01 |
Define the default false positive probability for bloom filters.
|
orc.use.zerocopy |
false |
Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.)
|
orc.skip.corrupt.data |
false |
If ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw exception. The default behavior is to throw exception.
|
orc.tolerate.missing.schema |
true |
Writers earlier than HIVE-4243 may have inaccurate schema metadata. This setting will enable best effort schema evolution rather than rejecting mismatched schemas
|
orc.memory.pool |
0.5 |
Maximum fraction of heap that can be used by ORC file writers
|
orc.dictionary.key.threshold |
0.8 |
If the number of distinct keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding. Use 1 to always use dictionary encoding.
|
orc.dictionary.early.check |
true |
If enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter.
|
orc.dictionary.implementation |
rbtree |
the implementation for the dictionary used for string-type column encoding. The choices are: rbtree - use red-black tree as the implementation for the dictionary. hash - use hash table as the implementation for the dictionary.
|
orc.bloom.filter.columns |
|
List of columns to create bloom filters for when writing.
|
orc.bloom.filter.write.version |
utf8 |
(Deprecated) Which version of the bloom filters should we write. The choices are: original - writes two versions of the bloom filters for use by both old and new readers. utf8 - writes just the new bloom filters.
|
orc.bloom.filter.ignore.non-utf8 |
false |
Should the reader ignore the obsolete non-UTF8 bloom filters.
|
orc.max.file.length |
9223372036854775807 |
The maximum size of the file to read for finding the file tail. This is primarily used for streaming ingest to read intermediate footers while the file is still open
|
orc.mapred.input.schema |
null |
The schema that the user desires to read. The values are interpreted using TypeDescription.fromString.
|
orc.mapred.map.output.key.schema |
null |
The schema of the MapReduce shuffle key. The values are interpreted using TypeDescription.fromString.
|
orc.mapred.map.output.value.schema |
null |
The schema of the MapReduce shuffle value. The values are interpreted using TypeDescription.fromString.
|
orc.mapred.output.schema |
null |
The schema that the user desires to write. The values are interpreted using TypeDescription.fromString.
|
orc.include.columns |
null |
The list of comma separated column ids that should be read with 0 being the first column, 1 being the next, and so on. .
|
orc.kryo.sarg |
null |
The kryo and base64 encoded SearchArgument for predicate pushdown.
|
orc.kryo.sarg.buffer |
8192 |
The kryo buffer size for SearchArgument for predicate pushdown.
|
orc.sarg.column.names |
null |
The list of column names for the SearchArgument.
|
orc.force.positional.evolution |
false |
Require schema evolution to match the top level columns using position rather than column names. This provides backwards compatibility with Hive 2.1.
|
orc.force.positional.evolution.level |
1 |
Require schema evolution to match the defined no. of level columns using position rather than column names. This provides backwards compatibility with Hive 2.1.
|
orc.rows.between.memory.checks |
5000 |
How often should MemoryManager check the memory sizes? Measured in rows added to all of the writers. Valid range is [1,10000] and is primarily meant fortesting. Setting this too low may negatively affect performance. Use orc.stripe.row.count instead if the value larger than orc.stripe.row.count.
|
orc.overwrite.output.file |
false |
A boolean flag to enable overwriting of the output file if it already exists.
|
orc.schema.evolution.case.sensitive |
true |
A boolean flag to determine if the comparision of field names in schema evolution is case sensitive .
|
orc.sarg.to.filter |
false |
A boolean flag to determine if a SArg is allowed to become a filter
|
orc.filter.use.selected |
false |
A boolean flag to determine if the selected vector is supported by the reading application. If false, the output of the ORC reader must have the filter reapplied to avoid using unset values in the unselected rows. If unsure please leave this as false.
|
orc.filter.plugin |
false |
Enables the use of plugin filters during read. The plugin filters are discovered against the service org.apache.orc.filter.PluginFilterService, if multiple filters are determined, they are combined using AND. The order of application is non-deterministic and the filter functionality should not depend on the order of application.
|
orc.filter.plugin.allowlist |
* |
A list of comma-separated class names. If specified it restricts the PluginFilters to just these classes as discovered by the PluginFilterService. The default of * allows all discovered classes and an empty string would not allow any plugins to be applied.
|
orc.write.variable.length.blocks |
false |
A boolean flag whether the ORC writer should write variable length HDFS blocks.
|
orc.column.encoding.direct |
|
Comma-separated list of columns for which dictionary encoding is to be skipped.
|
orc.max.disk.range.chunk.limit |
2147482623 |
When reading stripes >2GB, specify max limit for the chunk size.
|
orc.min.disk.seek.size |
0 |
When determining contiguous reads, gaps within this size are read contiguously and not seeked. Default value of zero disables this optimization
|
orc.min.disk.seek.size.tolerance |
0.0 |
Define the tolerance for extra bytes read as a result of orc.min.disk.seek.size. If the (bytesRead - bytesNeeded) / bytesNeeded is greater than this threshold then extra work is performed to drop the extra bytes from memory after the read.
|
orc.encrypt |
null |
The list of keys and columns to encrypt with
|
orc.mask |
null |
The masks to apply to the encrypted columns
|
orc.key.provider |
hadoop |
The kind of KeyProvider to use for encryption.
|
orc.proleptic.gregorian |
false |
Should we read and write dates & times using the proleptic Gregorian calendar instead of the hybrid Julian Gregorian? Hive before 3.1 and Spark before 3.0 used hybrid.
|
orc.proleptic.gregorian.default |
false |
This value controls whether pre-ORC 27 files are using the hybrid or proleptic calendar. Only Hive 3.1 and the C++ library wrote using the proleptic, so hybrid is the default.
|
orc.row.batch.size |
1024 |
The number of rows to include in an orc vectorized reader batch. The value should be carefully chosen to minimize overhead and avoid OOMs in reading data.
|
orc.row.child.limit |
32768 |
The maximum number of child elements to buffer before the ORC row writer writes the batch to the file.
|