MarkLogic Query and Search
Searching and Querying MarkLogic
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.
How does MarkLogic compare to a relational database?
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. MarkLogic collections group documents together in any way you want. Collections are like views in a relational database, which group together a filtered set of rows from one or more tables, but collections do not join and merge document content and views cannot arbitrarily include any row from any table. A document in MarkLogic can have a flat structure like a relational table, or it can contain complex nestings of data. Each nesting of data in a document represents another table in a relational database. For example, a single document in MarkLogic may represent many relational tables, and how they are nested represents their one-to-one, many-to-one, and many-to-many relationships.
Both MarkLogic and relational databases use indexes to resolve queries without having to scan through all documents or rows. MarkLogic indexes are designed to find any word, phrase, value, element, or element structure — no matter where it occurs in document structure, no matter how many times it occurs within a document, and no matter what type of document contains it. For example, you can search for and return a <title> element and MarkLogic will return the contents of all <title> elements it finds — even if <title> elements are in different types of documents. This is similar to doing a union query across multiple tables in relational databases.
MarkLogic indexes also let you search for words, phrases, values, elements, or element structures in very specific locations, such as in only certain types of documents or within certain document structures, or in specific collections or folders. For example, you can search for and return all <title> elements that are found in the header section of poetry documents in the Shakespeare and Tennyson collections.
MarkLogic is not designed for joining documents like a relational database. A relational join searches for matching values across two or more tables and then merges matching rows into a single row. This can be done in MarkLogic, but you have to write advanced code to iterate through query results to match values across different document types, and merge matching documents into a single document. Also since MarkLogic documents are often nested structures, you can't simply merge them, you have to copy nested content from one document into a specific nested location in another.
Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in all the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row look ups, but they become increasingly costly the more rows they retrieve. If an index has to look up 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.
In addition, a relational database often directly scans through index data rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.
Why do MarkLogic queries have to use Indexes?
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.
How does MarkLogic scale horizontally?
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)
A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
- It validates the query request from the client
- It parses and optimizes the query
- It sends the query to all servers that have forests in the database
- Each server sends it to its forests
- Each forest sends it to each stand
- Each stand uses its indexes to find matching documents
- All stands and forests query the indexes in parallel
- Forests combine the matching document IDs from all their stands
- Forests sort the matching document IDs (in relevance score order or value order)
- When complete, each forest returns a sorted iterator to the initiating server
- When each forest has returned an iterator
- It walks the sorted iterators from all the forests and combines the results in sorted
- It retrieves matching documents from its Expanded Tree Cache or from the forests when documents are not cached
- forests retrieve documents from their Compressed Tree Cache or from disk when documents are not cached
- It filters returned documents to remove false positive matches
- It optionally transforms documents
- It creates new documents by extracting matching elements, changing structure, changing file format, etc.
- It returns all matching (optionally transformed) documents to the requester
- It saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests
The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the serialized evaluation process would compete with the parallel index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the Expanded Tree Cache to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their Compressed Tree Cache to reduce disk IO.
How do you optimize a MarkLogic cluster to run queries faster?
- E-nodes and d-nodes can be scaled independently to match the load.
- You can optimize an e-node by minimizing its Compressed Tree Cache and maximizing its Expanded Tree Cache. An e-node needs a big Expanded Tree Cache so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a Compressed Tree Cache because it doesn't have any data.
- You can optimize a d-node by doing the opposite of the e-node: maximize the Compressed Tree Cache and minimize the Expanded Tree Cache. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
- You can optimize a query by making it unfiltered.
- This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
- You can optimize a query by eliminating false positives.
- False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
- False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an exact match, MarkLogic must open the document and verify it.
- For example, suppose a query searches for the following path:
books/poetry/poemThe structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as
- You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
- You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
- It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
- It is expensive to process documents on the initiating server and the requester.
- It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
- The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
- The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.
How does a MarkLogic query compare to map/reduce?
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a map process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then reduced at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.
Term, Range, and Semantic Indexes
MarkLogic indexes are not B-Tree indexes. MarkLogic uses term, range, and semantic indexes.
- A term index associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
- A range index is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
- A semantic index is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).
Term indexes are fast. Give the index a term, and with one look up, it returns all documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.
Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.
When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.
When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, which it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.
MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.
What specific indexes does MarkLogic have?
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents as if they contained only:
- Words: a flat set of words without structure, such as
["Mary", "had", "a", "little", "lamb"]
- Phrases: a flat set of phrases without structure, such as
["Mary had", "had a", "a little", "little lamb"]
- Elements: a flat set of elements without values, such as
["poem", "text", "line"]
- Element-values: a flat set of elements with string values, such as
"line": "Mary had a little lamb"
- Element-words: a flat set of elements with words, such as
"line": ["Mary", "had", "a", "little", "lamb"]
- Element-phrases: a flat set of elements with phrases, such as
"line": ["Mary had", "had a", "a little", "little lamb"]
- Element-value lexicon: a flat set of elements with typed values, such as
or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.
- XPath-structure-value lexicon: a flat set of hierarchical elements with typed values, such as
or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.
- Xpath-structure: hierarchical structures without values, such as
- Xpath-structure-words: hierarchical structures with words, such as
poem.text.line: ["Mary", "had", "a", "little", "lamb"]
- Xpath-structure-phrases: hierarchical structures with phrases, such as
poem.text.line: ["Mary had", "had a", "a little", "little lamb"]
- Xpath-structure-values: hierarchical structures with values, such as
poem.text.line: "Mary had a little lamb"
- RDF document links to other documents, such as
- RDF abstract links to abstract concepts, such as
- RDF data links to data, such as
MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.
Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.
Search Functions in MarkLogic
Executing a Search
- cts:search is the most important function because it executes a search and returns matching nodes.
- You must include an XPath expression to define which nodes to search within and to return.
- The search returns nodes, which can be documents, elements, attributes, or text. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
- You must include a cts query expression to filter the contents of nodes selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content.
- MarkLogic limits search results to words, phrases, values, and structures in the query that occur within the nodes specified by the XPath expression.
- You must include an XPath expression to define which nodes to search within and to return.
- If you want to return entire documents and search within specific elements and attributes, then the XPath expression should specify the root element and the query should use the element and element-attribute queries to limit where the search occurs.
- If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: cts:and-query(()).
- You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
- Use Cases:
- A search can return entire documents whose contents match the XPath and query expressions.
- A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
- A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.
- There is no functional difference between searching entire documents and elements because an element has the same structure as document. Both may contain complex nested elements or both may simply contain a value. In both cases, MarkLogic retrieves entire documents. In the case of elements, it extracts the requested elements from each retrieved document.
- Searching attributes, text nodes, and childless elements is similar to searching documents and elements with children — except the cts query cannot look for nested structures.
- cts:contains is the second most important function because it returns true when a cts query finds at least one matching node in the nodes returned by an XPath expression (or a sequence of values). In other words, it returns true when at least one of the specified nodes contains the precise combination of words, phrases, values, structures, and/or triples in a cts query.
- NOTE: in the second example for cts:not-query, it says cts:contains forces the constraint to happen in the filtering stage of the query. Is this always true for cts:contains, or is it only true in the example?
Constructing a Query
You compose a query using cts query constructor functions. You then execute the query by passing it into cts:search, cts:contains, search:search, and lexicon functions. MarkLogic has 30+ query constructors. They all end in "-query".
Composite Query Constructors
The composite query constructors build up new queries from other queries, and queries can be nested within queries.
- cts:and-query returns a query that intersects two or more sub-queries.
- You may optionally specify an "ordered" option that requires document contents to match sub-queries in the order they are listed; for example if the sub-queries are "To be" and "or not to be", then the query will match documents that have the phrase "To be" before "or not to be". The default is "unordered", which allows each sub-query to match the content in any order that it occurs in the document.
- If you pass in the empty sequence, such as cts:and-query(()), then it will match every document in the database.
- cts:or-query returns a query that unions two or more sub-queries.
- cts:not-query returns a query that subtracts one or more sub-queries.
- It works by taking the list of all documents in the database and subtracting the documents returned from its sub-queries. It returns all remaining documents. It does not filter documents returned from its sub-queries because filtering happens after all index resolution has completed, and cts:not-query executes during index resolution. Thus, only if its sub-queries return accurate results (with no false-positives), then the results of cts:not-query are accurate. Thus, it only produces accurate results when all its sub-queries are resolvable completely from indexes with no filtering required.
- Warning: cts:not-query can be inaccurate. It can be missing some documents. This happens when one or more of its sub-queries require filtering (i.e. can't be resolved accurately using only indexes). When sub-queries require filtering, they may return false-positive matches. A false-positive match occurs when the indexes cannot determine for sure if a document matches; so MarkLogic includes the document just in case. As the last step in a query, MarkLogic opens all potentially matching documents and filters out false-positive matches. Thus, false-positives are not a match, they are a failed potential match. False positives can be prevented by writing sub-queries that are resolved completely by the indexes so that there is no need for filtering.
- Because cts:not-query runs during index resolution (before filtering occurs), it subtracts out any false-positive matches returned by its sub-queries. False-positives should not be subtracted out. But MarkLogic doesn't know which documents are false-positives until filtering, and it can't do filtering during cts:not-query because it is in the middle of doing index resolution. Thus, MarkLogic subtracts out false-positives and this causes cts:not-query to return too few documents. If we prevent sub-queries from returning false-positives, cts:not-query returns accurate results.
Training Resources for MarkLogic Search
- "Grokking the cts API" is a great overview of MarkLogic's search and query capabilities.