public class TikaTextExtractor extends TextExtractor
TextExtractor that uses the Apache Tika library.
This extractor will automatically discover all of the Tika Parser implementations that are defined in
META-INF/services/org.apache.tika.parser.Parser text files accessible via the current classloader and that contain
the class names of the Parser implementations (one class name per line in each file).
This text extractor can be configured in a ModeShape configuration by specifying several optional properties:
package files are excluded, though explicitly setting any excluded MIME types will
override these default.TextExtractor.BinaryOperation<T>, TextExtractor.Context, TextExtractor.Output| Modifier and Type | Field and Description |
|---|---|
protected static Set<org.apache.tika.mime.MediaType> |
DEFAULT_EXCLUDED_MIME_TYPES
The MIME types that are excluded by default.
|
protected static Logger |
LOGGER |
| Constructor and Description |
|---|
TikaTextExtractor()
No-arg constructor is required because this is instantiated by reflection.
|
| Modifier and Type | Method and Description |
|---|---|
void |
extractFrom(Binary binary,
TextExtractor.Output output,
TextExtractor.Context context)
Extract text from the given
Binary, using the given output to record the results. |
protected Set<org.apache.tika.mime.MediaType> |
getExcludedMediaTypes() |
protected Set<org.apache.tika.mime.MediaType> |
getIncludedMediaTypes() |
protected Set<org.apache.tika.mime.MediaType> |
getParserSupportedMediaTypes() |
protected org.apache.tika.parser.DefaultParser |
initialize()
This class lazily initializes the
DefaultParser instance. |
protected org.apache.tika.metadata.Metadata |
prepareMetadata(Binary binary,
TextExtractor.Context context)
Creates a new tika metadata object used by the parser.
|
protected void |
setWriteLimit(Integer writeLimit)
Sets the write limit for the Tika parser, representing the maximum number of characters that should be extracted by the
TIKA parser.
|
boolean |
supportsMimeType(String mimeType)
Determine if this extractor is capable of processing content with the supplied MIME type.
|
String |
toString() |
getExcludedMimeTypes, getIncludedMimeTypes, getName, logger, processStream, setLogger, setNameprotected static final Logger LOGGER
protected static final Set<org.apache.tika.mime.MediaType> DEFAULT_EXCLUDED_MIME_TYPES
public TikaTextExtractor()
public boolean supportsMimeType(String mimeType)
TextExtractorsupportsMimeType in class TextExtractormimeType - the MIME type; never nullpublic void extractFrom(Binary binary, TextExtractor.Output output, TextExtractor.Context context) throws Exception
TextExtractorBinary, using the given output to record the results.extractFrom in class TextExtractorbinary - the binary value that can be used in the extraction process; never nulloutput - the output from the sequencing operation; never nullcontext - the context for the sequencing operation; never nullException - if there is a problem during the extraction processprotected final org.apache.tika.metadata.Metadata prepareMetadata(Binary binary, TextExtractor.Context context) throws IOException, RepositoryException
binary - a org.modeshape.jcr.api.Binary instance of the content being parsedcontext - the extraction context; may not be nullMetadata instance.IOException - if auto-detecting the mime-type via Tika failsRepositoryException - if error obtaining MIME-type of the binary parameterprotected org.apache.tika.parser.DefaultParser initialize()
DefaultParser instance.parserprotected void setWriteLimit(Integer writeLimit)
writeLimit - an Integer which represents the write limit; may be nullBodyContentHandler.BodyContentHandler(int)protected Set<org.apache.tika.mime.MediaType> getExcludedMediaTypes()
protected Set<org.apache.tika.mime.MediaType> getIncludedMediaTypes()
protected Set<org.apache.tika.mime.MediaType> getParserSupportedMediaTypes()
Copyright © 2008-2014 JBoss, a division of Red Hat. All Rights Reserved.