The ORC team is excited to announce the release of ORC v2.0.0.

New Feature and Notable Changes:

  • ORC-998: Refactor compression output buffer within OutStream for better portability
  • ORC-1088: Suport ZSTD_JNI and columnn compress to set compression level
  • ORC-1100: Support vcpkg
  • ORC-1251: Use Hadoop Vectored IO
  • ORC-1387: [C++] Support schema evolution from decimal to numeric/decimal
  • ORC-1440: Check for protobuf config based module
  • ORC-1463: Support brotli codec
  • ORC-1507: Use Zulu JDK distribution and switch from 21-ea to 21
  • ORC-1512: Drop Java 8/11 and make Java 17 by default
  • ORC-1531: Create orc-format module and repo
  • ORC-1545: Use orc-format 1.0.0-SNAPSHOT
  • ORC-1546: Use orc-format 1.0.0-alpha
  • ORC-1547: Spin-off ORC Format
  • ORC-1551: Use orc-format 1.0.0-beta
  • ORC-1572: Use Apache ORC Format 1.0.0
  • ORC-1585: [C++] Add orc-format_ep as a dependency of orc

Improvements:

  • ORC-1459: Mark DataBuffer::size() and DataBuffer::capacity() as const
  • ORC-1460: specification: Clarify how dictionary entries are sorted
  • ORC-1461: Mark Int128::getHighBits() and Int128::getLowBits() as const
  • ORC-1472: Replace deprecated method in TestMurmur3.java
  • ORC-1479: Enhance example usage message to use Uber jar
  • ORC-1481: [C++] Better error message when TZDB is unavailable
  • ORC-1504: Add lower bound check in get API for DynamicIntArray
  • ORC-1506: Replacing deprecated valueOf() with recommended forNumber()
  • ORC-1509: Auto grant contributor role to first-time contributors
  • ORC-1520: Remove JDK 8 settings from pom
  • ORC-1567: Add the -ignoreExtension configuration to the sizes and count commands of orc-tools
  • ORC-1570: Add supportVectoredIO API to HadoopShimsCurrent and use it
  • ORC-1571: Supports displaying raw data size in the meta command of orc-tools
  • ORC-1577: Use ZSTD as the default compression
  • ORC-1580: Change default DataBuffer constructor to use reserve instead of resize
  • ORC-1595: Add a short-cut to skip tiny inputs for ZstdCodec.compress
  • ORC-1596: Remove redundant Zstd.isError JNI usage
  • ORC-1597: Set bloom filter fpp to 1%
  • ORC-1600: Reduce getStaticMemoryManager sync block in OrcFile
  • ORC-1601: Reduce get HadoopShims sync block in HadoopShimsFactory
  • ORC-1610: Reduce the number of hash computation in CuckooSetBytes
  • ORC-1613: Zstd decompression supports direct buffer
  • ORC-1631: Supports summary output in sizes command
  • ORC-1637: [C++] Port conan recipe from upstream conan center
  • ORC-1638: Avoid System.exit(0) in count command
  • ORC-1639: [C++] Reduce unnecessary compiler flags in CMake
  • ORC-1641: Remove sourceFileExcludes from maven-javadoc-plugin
  • ORC-1642: Avoid System.exit(0) in scan command
  • ORC-1593: Set orc.compression.zstd.level to 3 by default

Bug Fixes:

  • ORC-634: Fix the json output for double NaN and infinite
  • ORC-1455: [C++] Fix build failure on non-x86 with unused macro in CpuInfoUtil.cc
  • ORC-1473: Zero-copy zeroCopyReadRanges and releaseBuffer bugs
  • ORC-1476: Maven build fail with unsupported platform: protoc-3.17.3-osx-aarch_64.exe
  • ORC-1480: [C++] Build failed when the BUILD_CPP_ENABLE_METRICS is ON
  • ORC-1500: [C++] The partition field does not support English special characters
  • ORC-1528: When using the orc.min.disk.seek.size configuration to read extremely large ORC files, a java.nio.BufferOverflowException may occur.
  • ORC-1553: Reading information from Row group, where there are 0 records of SArg column
  • ORC-1563: Fix orc.bloom.filter.fpp default value and orc.compress notes of Spark and Hive config docs
  • ORC-1568: Use readDiskRanges if orc.use.zerocopy is enabled
  • ORC-1575: Use ASF Archive URL instead Download URL
  • ORC-1578: Fix SparkBenchmark according to SPARK-40918
  • ORC-1588: Fix incorrect Decimal assert in LeafFilterFactory
  • ORC-1602: [C++] limit compression block size

Tasks:

  • ORC-1422: Setting version to 2.0.0-SNAPSHOT
  • ORC-1434: Remove org.apache.hadoop from dependabot.yml
  • ORC-1484: Use JIRA_ACCESS_TOKEN in merge_orc_pr.py
  • ORC-1485: Enable checkstyle checks for test classes
  • ORC-1486: Fix checkstyle violations for tests in orc-core module
  • ORC-1492: Fix checkstyle violations for tests in mapreduce, tools, bench modules
  • ORC-1496: Use iterator to suggest backporting branches
  • ORC-1515: Skip publishing orc-example module
  • ORC-1516: Fix minor typo in comments in IOUtils
  • ORC-1518: Remove findbugs folders
  • ORC-1529: Fix minor typos in pom.xml
  • ORC-1530: Rename variables in RecordReaderUtils.ChunkReader#create
  • ORC-1535: Remove generated Java docs from source tree
  • ORC-1536: Remove hive-storage-api link from maven-javadoc-plugin
  • ORC-1540: Remove MacOS 11 from GitHub Action CI
  • ORC-1542: Use Pattern Matching for instanceof (JEP-394)
  • ORC-1549: Update libhdfspp.tar.gz by adding #include <cstdint>
  • ORC-1569: Remove HadoopShimsPre2_3, HadoopShimsPre2_6, HadoopShimsPre2_7 classes
  • ORC-1579: Add ASF Generative Tooling Guidance to PR template
  • ORC-1591: Lower log level from INFO to DEBUG in *ReaderImpl/WriterImpl/PhysicalFsWriter
  • ORC-1592: Suppress KeyProvider missing log
  • ORC-1598: Close reader in orc-examples
  • ORC-1604: Deprecate non-utf8 bloom filter for Java writer

Tests:

  • ORC-1003: Recover java-examples-test
  • ORC-1409: Add stream order description in ORC spec.
  • ORC-1432: Add MacOS 13 GitHub Action Job
  • ORC-1474: Replaced deprecated getMinimum/Maximum in TestColumnStatistics
  • ORC-1475: [C++] ConvertColumnReader.TestConvertNumericToStringVariant fails when compiled with unsigned char
  • ORC-1477: Remove unused imports from Test classes
  • ORC-1478: Add Unit Test for org.apache.orc.impl.DynamicIntArray
  • ORC-1510: Fix package for TestOrcUtils and add more test cases
  • ORC-1541: Add Ubuntu 24.04 LTS Docker Test
  • ORC-1555: Simplify fedora37 docker image
  • ORC-1556: Add Rocky Linux 9 Docker Test
  • ORC-1557: Add GitHub Action CI for Docker Test
  • ORC-1558: Remove ubuntu22_jdk=21 and ubuntu22_jdk=21_cc=clang test combinations from docker/os-list.txt
  • ORC-1574: Update GitHub Action YAML files in branch-2.0
  • ORC-1586: Fix IllegalAccessError when SparkBenchmark runs on JDK17
  • ORC-1607: Fix testDoubleNaNAndInfinite to use TestFileDump.checkOutput
  • ORC-1614: Set ByteBuffer limit in TestBrotli test
  • ORC-1618: Disable building tests for snappy
  • ORC-1619: Add MacOS 14 to GitHub Action
  • ORC-1620: Add Apple Silicon Test Coverage
  • ORC-1621: Switch to oraclelinux9 from rocky9
  • ORC-1623: Use directOut.put(out) instead of directOut.put(out.array()) in TestZstd test
  • ORC-1630: Test using VectoredIO of hadoop to read ORC
  • ORC-1632: Add test for count command
  • ORC-1633: Add test for sizes command
  • ORC-1643: Add test for scan command

Build and dependency changes:

  • ORC-870: Unpin and upgrade jmh to 1.37
  • ORC-1423: Bump build-helper-maven-plugin to 3.4.0
  • ORC-1424: Bump maven-assembly-plugin to 3.6.0
  • ORC-1425: Bump checkstyle to 10.11.0
  • ORC-1427: Use Hadoop 3.3.5 in tools module
  • ORC-1429: Upgrade Maven to 3.8.8
  • ORC-1430: Use Hadoop 3.3.5 shaded clients
  • ORC-1431: Use parquet to 1.13.1 in bench module
  • ORC-1437: Bump checkstyle to 10.12.0
  • ORC-1438: Bump auto-service to 1.1.0
  • ORC-1439: Bump guava to 32.0.0-jre
  • ORC-1442: Update guava to 32.0.1
  • ORC-1445: Bump snappy-java to 1.1.10.1 in bench module
  • ORC-1448: Bump auto-service to 1.1.1
  • ORC-1456: Update Hadoop to 3.3.6
  • ORC-1466: Bump junit to 5.10.0
  • ORC-1467: Upgrade commons-lang3 to 3.13.0
  • ORC-1468: Bump opencsv to 5.8
  • ORC-1469: Update guava to 32.1.2
  • ORC-1470: Update maven-shade-plugin to 3.5.0
  • ORC-1493: Bump byte-buddy to 1.14.6
  • ORC-1502: Upgrade Maven to 3.9.4
  • ORC-1508: Upgrade slf4j to 2.0.9
  • ORC-1513: Upgrade snappy to 1.1.10.4
  • ORC-1514: Remove zookeeper runtime dependency
  • ORC-1517: Bump snappy-java to 1.1.10.5 in bench module
  • ORC-1521: Bump com.google.guava:guava to 32.1.3-jre
  • ORC-1522: Bump commons-cli:commons-cli to 1.6.0
  • ORC-1523: Bump maven-checkstyle-plugin to 3.3.1
  • ORC-1524: Bump maven-shade-plugin to 3.5.1
  • ORC-1526: Bump spotbugs-maven-plugin to 4.8.1.0
  • ORC-1527: Bump junit to 5.10.1
  • ORC-1533: Upgrade commons-lang3 to 3.14.0
  • ORC-1534: Upgrade build-helper-maven-plugin to 3.5.0
  • ORC-1537: Unpin and upgrade spotless to 2.41.0
  • ORC-1538: Unpin and upgrade maven-dependency-plugin to 3.6.1
  • ORC-1543: Bump spotless-maven-plugin to 2.41.1
  • ORC-1544: Unpin and upgrade protobuf-java to 3.25.1
  • ORC-1550: Upgrade Maven to 3.9.6
  • ORC-1562: Bump com.google.guava:guava to 33.0.0-jre
  • ORC-1565: Bump slf4j.version to 2.0.10
  • ORC-1566: Make Brotli dependency as optional
  • ORC-1576: Upgrade spark.jackson.version to 2.15.2 in bench module
  • ORC-1581: Bump slf4j.version to 2.0.11
  • ORC-1582: Bump protobuf-java to 3.25.2
  • ORC-1605: Upgrade brotli4j to 1.16.0
  • ORC-1616: Upgrade aircompressor to 0.26
  • ORC-1624: Upgrade Spark to 3.5.1
  • ORC-1626: Upgrade Mockito to 5.10 and byte-buddy to 1.14.11
  • ORC-1627: Unpin scala-library
  • ORC-1628: Bump protobuf-java to 3.25.3

Documentations:

  • ORC-994: Fix javadoc so that it doesn’t put files into the source tree
  • ORC-1471: Updated README.md to use maven 3.8.8
  • ORC-1491: Update Python documentation with PyArrow 13.0.0 and Dask 2023.8.1
  • ORC-1503: Update README.md to use maven 3.9.4
  • ORC-1552: Update README.md to use maven 3.9.6
  • ORC-1564: Add Java ORC configuration documentation
  • ORC-1584: Remove README about Proto subdirectory
  • ORC-1587: Fix usage command of SparkBenchmark document
  • ORC-1599: Add zstd compression level and windowlog in Java configuration documentation
  • ORC-1612: Document available encodings at orc.compress
  • ORC-1625: Switch to oraclelinux9 from rocky9 in README