@since Oak 1.7.0
Work in progress. Not to be used on production setups
With Oak 1.7 we have added some tooling as part of oak-run index command. Below are details around various operations supported by this command.
The index command supports connecting to different NodeStores via various options which are documented here. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup use the appropriate connection options.
By default the tool would generate output file in directory indexing-result which is referred to as output directory.
Unless specified all operations connect to the repository in read only mode
All the commands support following common options
Also refer to help output via -h command for some other options
java -jar oak-run*.jar index --fds-path=/path/to/datastore /path/to/segmentstore/ --index-info
Generates a report consisting of various stats related to indexes present in the given repository. The generated report is stored by default in <output dir>/index-info.txt
Supported for all index types
java -jar oak-run*.jar index --fds-path=/path/to/datastore /path/to/segmentstore/ --index-definitions
--index-definitions operation dumps the index definition in json format to a file <output dir>/index-definitions.json. The json file contains index definitions keyed against the index paths
Supported for all index types
java -jar oak-run*.jar index --fds-path=/path/to/datastore /path/to/segmentstore/ --index-dump
--index-dump operation dumps the index content in output directory. The output directory would contain one folder for each index. Each folder would have a property file index-details.txt which contains indexPath
Supported for only Lucene indexes.
java -jar oak-run*.jar index --fds-path=/path/to/datastore /path/to/segmentstore/ --index-consistency-check
--index-consistency-check operation performs index consistency check against various indexes. It supports 2 level
It would generate a report in <output dir>/index-consistency-check-report.txt
Supported for only Lucene indexes.
The reindex operation supports 2 modes of index
Supported for only Lucene indexes.
If the indexes being reindex have fulltext indexing enabled then refer to Tika Setup for steps on how to adapt the command to include Tika support for text extraction
Out of band indexing has following phases
If the index being reindexed involves fulltext index and the repository has binary content then its recommended that first text pre-extraction is performed. This ensures that costly operation around text extraction is done prior to actual indexing so that actual indexing does not do text extraction in critical path
Go to CheckpointMBean and create a checkpoint with a long enough lifetime like 10 days. For this invoke CheckpointMBean#createCheckpoint with 864000000 as argument for lifetime
In this step we perform the actual indexing via oak-run where it connects to repository in read only mode.
java -jar oak-run*.jar index --reindex \ --index-paths=/oak:index/indexName \ --checkpoint=0fd2a388-de87-47d3-8f30-e86b1cf0a081 \ --fds-path=/path/to/datastore /path/to/segmentstore/
Here following options can be used
If the index does not support fulltext indexing then you can omit providing BlobStore details
As a last step we need to import the index back in the repository. This can be done in one of the following ways
In this mode we import the index using oak-run
java -jar oak-run*.jar index --index-import --read-write \ --index-import-dir=<index dir> \ --fds-path=/path/to/datastore /path/to/segmentstore
Here “index dir” is the directory which contains the index files created in step #3. Check the logs from previous command for the directory path.
This mode should only be used when repository is from Oak version 1.7+ as oak-run connects to the repository in read-write mode.
Online indexing automates some of the manual steps which are required for out-of-band indexing.
This mode should only be used when repository is from Oak version 1.7+ as oak-run connects to the repository in read-write mode.
In this step we configure oak-run to connect to repository in read-write mode and let it perform all other steps i.e checkpoint creation, indexing and import
java -jar oak-run*.jar index --reindex --index-paths=/oak:index/lucene --read-write --fds-path=/path/to/datastore /path/to/segmentstore
@since Oak 1.7.5
Index tooling support updating and adding new index definitions to existing setups. This can be done by passing in path of a json file which contains index definitions
java -jar oak-run*.jar index --reindex --index-paths=/oak:index/newAssetIndex \ --index-definitions-file=index-definitions.json \ --fds-path=/path/to/datastore /path/to/segmentstore
Where index-definitions.json has following structure
{ "/oak:index/newAssetIndex": { "evaluatePathRestrictions": true, "compatVersion": 2, "type": "lucene", "async": "async", "jcr:primaryType": "oak:QueryIndexDefinition", "indexRules": { "jcr:primaryType": "nt:unstructured", "dam:Asset": { "jcr:primaryType": "nt:unstructured", "properties": { "jcr:primaryType": "nt:unstructured", "valid": { "name": "valid", "propertyIndex": true, "jcr:primaryType": "nt:unstructured", "notNullCheckEnabled": true }, "mimetype": { "name": "mimetype", "analyzed": true, "jcr:primaryType": "nt:unstructured" } } } } } }
Some points to note about this json file * Each key of top level object refers to the index path * The value of each such key refers to complete index definition * If the index path is not present in existing repository then it would result in a new index being created * In case of new index it must be ensured that parent path structure must already exist in repository. So if a new index is being created at /content/en/oak:index/contentIndex then path upto /content/en/oak:index should already exist in repository * If this option is used with online indexing then do ensure that oak-run version matches with the Oak version used by target repository
You can also use the json file generated from Oakutils. It needs to be modified to confirm to above structure i.e. enclose the whole definition under the intended index path key.
In general the index definitions does not need any special encoding of values as Index definitions in Oak use only String, Long and Double types mostly. However if the index refers to binary config like Tika config then the binary data would need to encoded. Refer to next section for more details.
This option is supported in both online and out-of-band indexing.
For more details refer to OAK-6471
Some of the standard types used in Oak are not supported directly by JSON like names, blobs etc. Those would need to be encoded in a specific format.
Below are the encoding rules
If the indexes being reindex have fulltext indexing enabled then you need to include Tika library in classpath. This is required even if pre extraction is used so as to ensure that any new binary added after pre-extraction is done can be indexed.
First download the tika-app jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.
Then modify the index command like below. The rest of arguments remain same as documented before.
java -cp oak-run.jar:tika-app-1.15.jar org.apache.jackrabbit.oak.run.Main index