Java Tools

In addition to the C++ tools, there is an ORC tools jar that packages several useful utilities and the necessary Java dependencies (including Hadoop) into a single package. The Java ORC tool jar supports both the local file system and HDFS.

The subcommands for the tools are:

  • check (since ORC 2.0.1) - check the index of the specified column
  • convert (since ORC 1.4) - convert CSV/JSON/ORC files to ORC
  • count (since ORC 1.6) - recursively find *.orc and print the number of rows
  • data - print the data of an ORC file
  • json-schema (since ORC 1.4) - determine the schema of JSON documents
  • key (since ORC 1.5) - print information about the encryption keys
  • merge (since ORC 2.0.1) - merge multiple ORC files into a single ORC file
  • meta - print the metadata of an ORC file
  • scan (since ORC 1.3) - scan the data for benchmarking
  • sizes (since ORC 1.7.2) - list size on disk of each column
  • version (since ORC 1.6) - print the version of this ORC tool

The command line looks like:

% java -jar orc-tools-X.Y.Z-uber.jar <sub-command> <args>

Java Check

The check command can check whether the specified value of the column specified by multiple ORC files can be filtered.

Check statistics and bloom filter index on x column.

% java -jar orc-tools-X.Y.Z-uber.jar check --type predicate /path/to/example.orc --values 1234 --values 5566 --column x

Check statistics on x column.

% java -jar orc-tools-X.Y.Z-uber.jar check --type stat /path/to/example.orc --values 1234 --values 5566 --column x

Check bloom filter index on x column.

% java -jar orc-tools-X.Y.Z-uber.jar check --type bloom-filter /path/to/example.orc --values 1234 --values 5566 --column x

Java Convert

The convert command reads several CSV/JSON/ORC files and converts them into a single ORC file.

-b,--bloomFilterColumns <columns>
Comma separated values of column names for which bloom filter is to be created. By default, no bloom filters will be created.
-e,--escape <escape>
Sets CSV escape character
-h,--help
Print help
-H,--header <header>
Sets CSV header lines
-n,--null <null>
Sets CSV null string
-o,--output <filename>
Sets the output ORC filename, which defaults to output.orc
-O,--overwrite
If the file already exists, it will be overwritten
-q,--quote <quote>
Sets CSV quote character
-s,--schema <schema>
Sets the schema for the ORC file. By default, the schema is automatically discovered.
-S,--separator <separator>
Sets CSV separator character
-t,--timestampformat <timestampformat>
Sets timestamp Format

The automatic JSON schema discovery is equivalent to the json-schema tool below.

Java Count

The count command recursively find *.orc and print the number of rows.

Java Data

The data command prints the data in an ORC file as a JSON document. Each record is printed as a JSON object on a line. Each record is annotated with the fieldnames and a JSON representation that depends on the field’s type.

-h,--help
Print help
-n,--lines <LINES>
Sets lines of data to be printed

Java JSON Schema

The JSON Schema discovery tool processes a set of JSON documents and produces a schema that encompasses all of the records in all of the documents. It works by computing the enclosing type and promoting it to include all of the observed values.

-f,--flat
Print the schema as a list of flat types for each subfield
-h,--help
Print help
-p,--pretty
Pretty print the schema
-t,--table
Print the schema as a Hive table declaration

Java Key

The key command prints the information about the encryption keys.

-h,--help
Print help
-o,--output <output>
Output filename

Java Meta

The meta command prints the metadata about the given ORC file and is equivalent to the Hive ORC File Dump command.

--backup-path <path>
when used with –recover specifies the path where the recovered file is written (default: /tmp)
--column-type
Print the column id, name and type of each column
-d,--data
Should the data be printed
-h,--help
Print help
-j,--json
Format the output in JSON
-p,--pretty
Pretty print the output
-r,--rowindex <ids>
Print the row indexes for the comma separated list of column ids
--recover
Skip over corrupted values in the ORC file
--skip-dump
Skip dumping the metadata
-t,--timezone
Print the timezone of the writer

An example of the output is given below:

% java -jar orc-tools-X.Y.Z-uber.jar meta examples/TestOrcFile.test1.orc
Processing data file examples/TestOrcFile.test1.orc [length: 1711]
Structure for examples/TestOrcFile.test1.orc
File Version: 0.12 with HIVE_8732
Rows: 2
Compression: ZLIB
Compression size: 10000
Type: struct<boolean1:boolean,byte1:tinyint,short1:smallint,int1:int,
long1:bigint,float1:float,double1:double,bytes1:binary,string1:string,
middle:struct<list:array<struct<int1:int,string1:string>>>,list:array<
struct<int1:int,string1:string>>,map:map<string,struct<int1:int,string1:
string>>>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 2 hasNull: false
    Column 1: count: 2 hasNull: false true: 1
    Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
    Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
    Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
    Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 9223372036854775807
    Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
    Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
    Column 8: count: 2 hasNull: false sum: 5
    Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
    Column 10: count: 2 hasNull: false
    Column 11: count: 2 hasNull: false
    Column 12: count: 4 hasNull: false
    Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
    Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
    Column 15: count: 2 hasNull: false
    Column 16: count: 5 hasNull: false
    Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 99901241
    Column 18: count: 5 hasNull: false min: bad max: in sum: 15
    Column 19: count: 2 hasNull: false
    Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
    Column 21: count: 2 hasNull: false
    Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
    Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12

File Statistics:
  Column 0: count: 2 hasNull: false
  Column 1: count: 2 hasNull: false true: 1
  Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
  Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
  Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
  Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 9223372036854775807
  Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
  Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
  Column 8: count: 2 hasNull: false sum: 5
  Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
  Column 10: count: 2 hasNull: false
  Column 11: count: 2 hasNull: false
  Column 12: count: 4 hasNull: false
  Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
  Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
  Column 15: count: 2 hasNull: false
  Column 16: count: 5 hasNull: false
  Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 99901241
  Column 18: count: 5 hasNull: false min: bad max: in sum: 15
  Column 19: count: 2 hasNull: false
  Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
  Column 21: count: 2 hasNull: false
  Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
  Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12

Stripes:
  Stripe: offset: 3 data: 243 rows: 2 tail: 199 index: 570
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 22
    Stream: column 2 section ROW_INDEX start: 36 length 26
    Stream: column 3 section ROW_INDEX start: 62 length 27
    Stream: column 4 section ROW_INDEX start: 89 length 30
    Stream: column 5 section ROW_INDEX start: 119 length 28
    Stream: column 6 section ROW_INDEX start: 147 length 34
    Stream: column 7 section ROW_INDEX start: 181 length 34
    Stream: column 8 section ROW_INDEX start: 215 length 21
    Stream: column 9 section ROW_INDEX start: 236 length 30
    Stream: column 10 section ROW_INDEX start: 266 length 11
    Stream: column 11 section ROW_INDEX start: 277 length 16
    Stream: column 12 section ROW_INDEX start: 293 length 11
    Stream: column 13 section ROW_INDEX start: 304 length 24
    Stream: column 14 section ROW_INDEX start: 328 length 31
    Stream: column 15 section ROW_INDEX start: 359 length 16
    Stream: column 16 section ROW_INDEX start: 375 length 11
    Stream: column 17 section ROW_INDEX start: 386 length 32
    Stream: column 18 section ROW_INDEX start: 418 length 30
    Stream: column 19 section ROW_INDEX start: 448 length 16
    Stream: column 20 section ROW_INDEX start: 464 length 37
    Stream: column 21 section ROW_INDEX start: 501 length 11
    Stream: column 22 section ROW_INDEX start: 512 length 24
    Stream: column 23 section ROW_INDEX start: 536 length 37
    Stream: column 1 section DATA start: 573 length 5
    Stream: column 2 section DATA start: 578 length 6
    Stream: column 3 section DATA start: 584 length 9
    Stream: column 4 section DATA start: 593 length 11
    Stream: column 5 section DATA start: 604 length 12
    Stream: column 6 section DATA start: 616 length 11
    Stream: column 7 section DATA start: 627 length 15
    Stream: column 8 section DATA start: 642 length 8
    Stream: column 8 section LENGTH start: 650 length 6
    Stream: column 9 section DATA start: 656 length 8
    Stream: column 9 section LENGTH start: 664 length 6
    Stream: column 11 section LENGTH start: 670 length 6
    Stream: column 13 section DATA start: 676 length 7
    Stream: column 14 section DATA start: 683 length 6
    Stream: column 14 section LENGTH start: 689 length 6
    Stream: column 14 section DICTIONARY_DATA start: 695 length 10
    Stream: column 15 section LENGTH start: 705 length 6
    Stream: column 17 section DATA start: 711 length 25
    Stream: column 18 section DATA start: 736 length 18
    Stream: column 18 section LENGTH start: 754 length 8
    Stream: column 19 section LENGTH start: 762 length 6
    Stream: column 20 section DATA start: 768 length 15
    Stream: column 20 section LENGTH start: 783 length 6
    Stream: column 22 section DATA start: 789 length 6
    Stream: column 23 section DATA start: 795 length 15
    Stream: column 23 section LENGTH start: 810 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT
    Encoding column 2: DIRECT
    Encoding column 3: DIRECT_V2
    Encoding column 4: DIRECT_V2
    Encoding column 5: DIRECT_V2
    Encoding column 6: DIRECT
    Encoding column 7: DIRECT
    Encoding column 8: DIRECT_V2
    Encoding column 9: DIRECT_V2
    Encoding column 10: DIRECT
    Encoding column 11: DIRECT_V2
    Encoding column 12: DIRECT
    Encoding column 13: DIRECT_V2
    Encoding column 14: DICTIONARY_V2[2]
    Encoding column 15: DIRECT_V2
    Encoding column 16: DIRECT
    Encoding column 17: DIRECT_V2
    Encoding column 18: DIRECT_V2
    Encoding column 19: DIRECT_V2
    Encoding column 20: DIRECT_V2
    Encoding column 21: DIRECT
    Encoding column 22: DIRECT_V2
    Encoding column 23: DIRECT_V2

File length: 1711 bytes
Padding length: 0 bytes
Padding ratio: 0%
______________________________________________________________________

Java Scan

The scan command reads the contents of the file without printing anything. It is primarily intendend for benchmarking the Java reader without including the cost of printing the data out.

-h,--help
Print help
-s,--schema
Print schema
-v,--verbose
Print exceptions

Java Sizes

The sizes command lists size on disk of each column. The output contains not only the raw data of the table, but also the size of metadata such as padding, stripeFooter, fileFooter, stripeIndex and stripeData.

% java -jar orc-tools-X.Y.Z-uber.jar sizes examples/my-file.orc
Percent  Bytes/Row  Name
  98.45  2.62       y
  0.81   0.02       _file_footer
  0.30   0.01       _index
  0.25   0.01       x
  0.19   0.01       _stripe_footer
______________________________________________________________________

Java Merge

The merge command can merge multiple ORC files that all have the same schema into a single ORC file.

% java -jar orc-tools-X.Y.Z-uber.jar merge --output /path/to/merged.orc /path/to/input_orc/
______________________________________________________________________

Java Version

The version command prints the version of this ORC tool.

Next