Class WriterImpl

java.lang.Object
org.apache.orc.impl.WriterImpl
All Implemented Interfaces:
Closeable, AutoCloseable, WriterInternal, MemoryManager.Callback, Writer
Direct Known Subclasses:
WriterImplV2

public class WriterImpl extends Object implements WriterInternal, MemoryManager.Callback
An ORC file writer. The file is divided into stripes, which is the natural unit of work when reading. Each stripe is buffered in memory until the memory reaches the stripe size and then it is written out broken down by columns. Each column is written by a TreeWriter that is specific to that type of column. TreeWriters may have children TreeWriters that handle the sub-types. Each of the TreeWriters writes the column's data as a set of streams.

This class is unsynchronized like most Stream objects, so from the creation of an OrcFile and all access to a single instance has to be from a single thread.

There are no known cases where these happen between different threads today.

Caveat: the MemoryManager is created during WriterOptions create, that has to be confined to a single thread as well.

  • Constructor Details

  • Method Details

    • getEstimatedBufferSize

      public static int getEstimatedBufferSize(long stripeSize, int numColumns, int bs)
    • increaseCompressionSize

      public void increaseCompressionSize(int newSize)
      Description copied from interface: WriterInternal
      Increase the buffer size for this writer. This function is internal only and should only be called by the ORC file merger.
      Specified by:
      increaseCompressionSize in interface WriterInternal
      Parameters:
      newSize - the new buffer size.
    • createCodec

      public static CompressionCodec createCodec(CompressionKind kind)
    • checkMemory

      public boolean checkMemory(double newScale) throws IOException
      Description copied from interface: MemoryManager.Callback
      The scale factor for the stripe size has changed and thus the writer should adjust their desired size appropriately.
      Specified by:
      checkMemory in interface MemoryManager.Callback
      Parameters:
      newScale - the current scale factor for memory allocations
      Returns:
      true if the writer was over the limit
      Throws:
      IOException
    • getSchema

      public TypeDescription getSchema()
      Description copied from interface: Writer
      Get the schema for this writer
      Specified by:
      getSchema in interface Writer
      Returns:
      the file schema
    • addUserMetadata

      public void addUserMetadata(String name, ByteBuffer value)
      Description copied from interface: Writer
      Add arbitrary meta-data to the ORC file. This may be called at any point until the Writer is closed. If the same key is passed a second time, the second value will replace the first.
      Specified by:
      addUserMetadata in interface Writer
      Parameters:
      name - a key to label the data with.
      value - the contents of the metadata.
    • addRowBatch

      public void addRowBatch(org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch batch) throws IOException
      Description copied from interface: Writer
      Add a row batch to the ORC file.
      Specified by:
      addRowBatch in interface Writer
      Parameters:
      batch - the rows to add
      Throws:
      IOException
    • close

      public void close() throws IOException
      Description copied from interface: Writer
      Flush all of the buffers and close the file. No methods on this writer should be called afterwards.
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Specified by:
      close in interface Writer
      Throws:
      IOException
    • getRawDataSize

      public long getRawDataSize()
      Raw data size will be compute when writing the file footer. Hence raw data size value will be available only after closing the writer.
      Specified by:
      getRawDataSize in interface Writer
      Returns:
      raw data size
    • getNumberOfRows

      public long getNumberOfRows()
      Row count gets updated when flushing the stripes. To get accurate row count call this method after writer is closed.
      Specified by:
      getNumberOfRows in interface Writer
      Returns:
      row count
    • writeIntermediateFooter

      public long writeIntermediateFooter() throws IOException
      Description copied from interface: Writer
      Write an intermediate footer on the file such that if the file is truncated to the returned offset, it would be a valid ORC file.
      Specified by:
      writeIntermediateFooter in interface Writer
      Returns:
      the offset that would be a valid end location for an ORC file
      Throws:
      IOException
    • appendStripe

      public void appendStripe(byte[] stripe, int offset, int length, StripeInformation stripeInfo, OrcProto.StripeStatistics stripeStatistics) throws IOException
      Description copied from interface: Writer
      Fast stripe append to ORC file. This interface is used for fast ORC file merge with other ORC files. When merging, the file to be merged should pass stripe in binary form along with stripe information and stripe statistics. After appending last stripe of a file, use appendUserMetadata() to append any user metadata. This form only supports files with no column encryption. Use Writer.appendStripe(byte[], int, int, StripeInformation, StripeStatistics[]) for files with encryption.
      Specified by:
      appendStripe in interface Writer
      Parameters:
      stripe - - stripe as byte array
      offset - - offset within byte array
      length - - length of stripe within byte array
      stripeInfo - - stripe information
      stripeStatistics - - unencrypted stripe statistics
      Throws:
      IOException
    • appendStripe

      public void appendStripe(byte[] stripe, int offset, int length, StripeInformation stripeInfo, StripeStatistics[] stripeStatistics) throws IOException
      Description copied from interface: Writer
      Fast stripe append to ORC file. This interface is used for fast ORC file merge with other ORC files. When merging, the file to be merged should pass stripe in binary form along with stripe information and stripe statistics. After appending last stripe of a file, use Writer.addUserMetadata(String, ByteBuffer) to append any user metadata.
      Specified by:
      appendStripe in interface Writer
      Parameters:
      stripe - - stripe as byte array
      offset - - offset within byte array
      length - - length of stripe within byte array
      stripeInfo - - stripe information
      stripeStatistics - - stripe statistics with the last one being for the unencrypted data and the others being for each encryption variant.
      Throws:
      IOException
    • appendUserMetadata

      public void appendUserMetadata(List<OrcProto.UserMetadataItem> userMetadata)
      Description copied from interface: Writer
      Update the current user metadata with a list of new values.
      Specified by:
      appendUserMetadata in interface Writer
      Parameters:
      userMetadata - - user metadata
    • getStatistics

      public ColumnStatistics[] getStatistics()
      Description copied from interface: Writer
      Get the statistics about the columns in the file. The output of this is based on the time at which it is called. It shall use all of the currently written data to provide the statistics. Please note there are costs involved with invoking this method and should be used judiciously.
      Specified by:
      getStatistics in interface Writer
      Returns:
      the information about the column
    • getStripes

      public List<StripeInformation> getStripes() throws IOException
      Description copied from interface: Writer
      Get the stripe information about the file. The output of this is based on the time at which it is called. It shall return stripes that have been completed. After the writer is closed this shall give the complete stripe information.
      Specified by:
      getStripes in interface Writer
      Returns:
      stripe information
      Throws:
      IOException
    • getCompressionCodec

      public CompressionCodec getCompressionCodec()
    • estimateMemory

      public long estimateMemory()
      Description copied from interface: Writer
      Estimate the memory currently used by the writer to buffer the stripe. `This method help write engine to control the refresh policy of the ORC.`
      Specified by:
      estimateMemory in interface Writer
      Returns:
      the number of bytes