TechWiki - User contributions [en]

What is MarkLogic?

2014-11-24T16:15:06Z

Bowersmt: Updated text

[[MarkLogic|« Back to MarkLogic]]

== Describing MarkLogic ==
[http://www.marklogic.com/what-is-marklogic/ MarkLogic] is a [http://www.marklogic.com/what-is-marklogic/inside-marklogic/ '''REST application server''', '''document database''', and '''search engine''']. It is REST through and through. It is built specifically for hypertext documents, links, metadata, URIs, MIME types, and HTTP. It is schema-agnostic because it is automatically aware of the independent structure of each of its JSON and XML documents. It is search-centric because it can search for ''any combination'' of words, values, structures, and links within and across documents. It scales horizontally to hundreds of servers within and between data centers while maintaining ACID-compliant transactions.

MarkLogic has the following [http://www.marklogic.com/what-is-marklogic/enterprise-nosql/ enterprise features]:
* [http://www.marklogic.com/resources/java-developers-guide/ Java APIs]
* [http://en.wikipedia.org/wiki/Node.js Node.js APIs]
* [http://developer.marklogic.com/learn/2009-07-search-api-walkthrough Search] and [http://developer.marklogic.com/learn/arch/search-and-indexing Query]
* [http://www.marklogic.com/blog/can-you-pass-the-acid-test/ ACID Transactions]
* [http://www.marklogic.com/resources/marklogic-high-availability-and-disaster-recovery/resource_download/datasheets/ High Availability] and [http://docs.marklogic.com/guide/database-replication/dbrep_intro#chapter Disaster Recovery]
* [http://www.marklogic.com/resources/marklogic-flexible-replication/resource_download/datasheets/ Replication within and across data centers]
* [http://docs.marklogic.com/guide/admin/security#chapter Government-grade Security]
* [https://docs.marklogic.com/guide/cluster.pdf Scalability] and [https://docs.marklogic.com/guide/admin/database-rebalancing#chapter Elasticity]
* [https://docs.marklogic.com/guide/ec2/managing#chapter On-premise or Cloud Deployment (especially AWS)]
* [http://www.marklogic.com/resources/marklogic-and-hadoop/resource_download/datasheets/ Hadoop for Storage and Compute]
* [http://www.marklogic.com/resources/marklogic-semantics-mlw14/ Semantics]

== What is REST? ==
REST is an architectural style that uses a uniform resource identifier ('''URI''') and a web protocol ('''HTTP/HTTPS''') to request and transfer a representation ('''MIME media type''') of the state of a resource ('''document''') at a point in time from a server to a client.

REST was coined and defined by Roy Fielding in his
[http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm dissertation, ''Architectural Styles and the Design of Network-based Software Architectures'']. REST standardizes and documents the patterns used in the world wide web: self-documenting hypertext with data and metadata (HTML), stateless request/response communication protocol (HTTP/HTTPS), resource locators (URL), multiple representations per URL (MIME types), and downloading code on demand to process resources (JavaScript, CSS, etc.).

REST consists of three main concepts:
* (RE) Representation
* (S) State
* (T) Transfer

== Why is MarkLogic RESTful? ==

=== '''Representation and MarkLogic''' ===
* A representation is a document that represents a resource. It has three requirements: MIME media type, URI, and hypertext.

* '''MIME type:''' Each resource can be represented by one or more types of documents. The MIME media type defines the representation a client requests. A client may request the resource to be represented as JSON, XML, HTML, PNG, PDF, etc.
**'''MarkLogic''' stores any type of document and assigns the appropriate [http://developer.marklogic.com/blog/document-formats-part2 MIME type] to it. It can fully index, query, search, and process JSON, XML, and RDF documents. It meets the REST requirement of being able to transform JSON, XML, and RDF documents into any other MIME type. It knows how to execute JavaScript, XQuery, XSLT, and SPARQL documents. It knows how [http://docs.marklogic.com/guide/cpf/default#chapter to transform into XHTML] the content, formatting, and structure of Microsoft Word, PowerPoint, Excel, textual PDF, DocBook, and CSS documents. (It can also use Microsoft Office to create, edit, and manage content in MarkLogic.) It knows how to [https://docs.marklogic.com/guide/search-dev/binary-document-metadata extract metadata and text from over 138 types of binary documents], such as raster images, vector images, videos, archive files, database files, encoded emails, presentations, spreadsheets, word-processing documents, text formats.
**Few NoSQL databases use MIME types to identify the media type of each document. Most NoSQL databases support only one type of data and it is usually proprietary: columnar, BSON, binary, etc. Most cannot transform from one MIME type to another.

* '''URI:''' A resource is identified by a globally unique identifier (URI). '''MarkLogic''' identifies each document with a [http://developer.marklogic.com/try/rest/page2 unique URI]. A document in MarkLogic is like a row in a table in a schema in a relational database. A URI is liberating. It provides random access to any resource anywhere. It is like being able to retrieve any row in any table in any schema of a relational database without having to know what table and schema the row is stored in.
**Navigating URL hierarchy is fundamental to REST. MarkLogic understands the hierarchy within a URI, which is represented by slashes "/", such as https://www.lds.org/scriptures/ot/gen/1 MarkLogic treats the items between the slashes as folders. The URI of each document automatically places it in a folder in a folder hierarchy. A URI automatically defines the folder hierarchy. In the example URI above, the document for Genesis chapter 1, is located in the Genesis folder, which is located in the Old Testament folder, which is located in the Scriptures folder on lds.org. '''MarkLogic''' indexes the documents in each folder and its subfolders. This makes it fast and easy to retrieve any or all documents in any folder and/or its subfolders.
**Few NoSQL databases use the URI as the primary key for their documents or data. They also don't index the URI hierarchically to filter documents by folder and subfolder.

* '''Hypertext:''' A document should contain data that represents the resource. The data should be '''human readable and self-documenting''', like JSON, XML, RDF, and HTML. It should be '''linked data''' (i.e. the "hyper" in "hypertext"). A document should contain ''' metadata links''' about the resource, such as RDF. It should contain '''action links''' to define what further actions can be done with the resource. It should contain '''related links''' to related resources, such as images, audio, video, related documents, etc. Each related link should define what actions can be done with the referenced resource, such as download it, display a link to it, execute a command against it, etc.)
**'''Hypertext''' or '''hypermedia''' documents must have all these features. Hyperlinks are what the "hyper" in hypertext and hypermedia refers to. You can't have REST without ''metadata links'' to define what the data means, ''action links'' to know how to work with the resource, and ''related links'' to connect resources. It should all be human readable and self-documenting so a developer does not have to read documentation to know how to interact with a REST web service and its documents.
** '''MarkLogic''' meets all the requirements for hypertext representation. It is designed around MIME types, URIs, and Linked data. It stores documents with their MIME types as [http://www.w3schools.com/json/ JSON], [http://www.w3schools.com/xml/ XML], [http://www.w3schools.com/webservices/ws_rdf_intro.asp RDF], [http://www.w3schools.com/html/ HTML], etc. These documents are human-readable and self-documenting, which MarkLogic leverages to recognize and index each document's data, data structure, metadata links, action links, and related links. This makes it easy to '''search''', '''query''', '''transform''', and '''deliver''' hypertext documents. MarkLogic can also store any type of binary document and deliver it as a related resource, such as an image, video, or audio. MarkLogic is designed to process simple links and RDF links using [http://en.wikipedia.org/wiki/SPARQL SPARQL] and [https://docs.marklogic.com/xinc XInclude]. MarkLogic can represent links in many formats: [https://docs.marklogic.com/xp XPointer], [https://docs.marklogic.com/guide/semantics/loading#id_97709 RDF/XML], [https://docs.marklogic.com/guide/semantics/loading#id_79194 RDF/JSON], [https://docs.marklogic.com/guide/semantics/loading#id_73211 Turtle], [https://docs.marklogic.com/guide/semantics/loading#id_70596 N-Triples], [https://docs.marklogic.com/guide/semantics/loading#id_61596 N-Quads], and [https://docs.marklogic.com/guide/semantics/loading#id_74485 TriG].
**No other NoSQL database natively indexes and fully processes all the document types required for REST hypertext: JSON, XML, RDF, HTML, CSS, and JavaScript.

=== '''State and MarkLogic''' ===
* State in REST exists on clients and in documents, data, and state machine data stored in servers. It does not exist in the communications protocol or in session data. It often exists in caches.

* '''Server:''' All information needed to process a request must be presented in the request and processed against documents in the database. State must only be in the request and in database documents: state cannot be anywhere else, such as in a session cache. The documents in the server define the state of the server. A REST server should explicitly create a state machine that defines the acceptable actions that can transacted against documents in specific contexts.
**A REST transaction occurs at a '''point in time'''. The state of the data in the request is unchanging, but the state of the documents and state machines in the database are often changing. Since request state and database state are both required to process the request and since shifting state creates unpredictable results, a REST transaction should run at a point in time with unchanging state. Only an ACID-compliant database can ensure consistent state because it isolates each transaction from every other transaction. The only time REST does not need an ACID-compliant database is when database documents do not change, database state machines do not change, or when clients can live with the resultant level of unreliable and unpredictable results.
**'''MarkLogic''' meets all the requirements for REST state. Its web services are stateless: there is no session cache. It is ACID compliant. It ensures each transaction occurs at a point in time and is isolated from all other transactions. This ensures consistent processing during a transaction -- even across billions of documents. MarkLogic is an [http://en.wikipedia.org/wiki/Multiversion_concurrency_control MVCC] database which provides transaction isolation without slowing the performance of reads -- even when documents being read are being modified simultaneously by other transactions. (Also, like any other ACID database, when multiple updates and deletes compete for the same documents, they will impact each other's performance because change has to be serialized.)
** Most other NoSQL databases are not ACID compliant. They are only suitable for REST services when their documents or data do not change or when the rate of change is slow enough or dispersed enough that it creates an acceptable level of unreliability and unpredictability.

* '''Client:''' A REST client, such as a web browser or spider, locally maintains transactional state, such as what to do in response to documents and result codes that are returned from server transactions. An application exists only in the client -- not in the server (although, a server may deliver application code to a client, such as when a web server downloads HTML, CSS, and JavaScript to a browser). Client application code decides when to execute web service calls and it ties the results together to accomplish its purpose. This allows multiple authorized applications to reuse web services for a variety of purposes.
**The server helps the client know what web service calls are available by providing action links with each response. Action links are contextual and the context is based on the application account, user account, database documents, links within and between documents, the server state machine, etc. Through action links, the server can inform a client application what web service calls are permissible in any given context.
**'''MarkLogic''' meets the needs of client applications through its built-in ability to process and send action links to clients based on context. MarkLogic supports RDF triples and SPARQL, which enables context to be defined across applications, users, document state, links between documents, server state machine, etc. MarkLogic processes triples very quickly, which enables context to scale to billions of documents, millions of users, etc.
**'''MarkLogic''' also uses application and user permissions to filter which documents are returned to clients. MarkLogic does this automatically and transparently by adding security filtering constraints into every search and query. This ensures no account can access unauthorized documents. This is fast because all security permissions are built-into MarkLogic's indexes -- which allows document-level security to scale across billions of documents.
** Most other NoSQL databases do not provide government-grade, document-level security and they also do not support RDF triples and SPARQL.

Because MarkLogic can provide both the web service and database in one server, it is easy to use the state of the documents and the

=== '''Transfer and MarkLogic''' ===
* '''Transfer''' in REST is a communication protocol that enables a client to send a hypertext request to a server and receive back a hypertext response. The transfer must be stateless and be a request/response protocol. It must have human readable, self-descriptive headers. The header must contain metadata about the request, such as the requested resource URI, MIME type of the resource, action to perform on the resource (such as get, put, post, patch, and delete). [http://www.w3schools.com/tags/ref_httpmethods.asp '''HTTP'''] (Hypertext transfer protocol) and '''HTTPS''' (secure HTTP) are designed specifically for REST (that is why they have "hypertext" in their name).

* All '''MarkLogic''' communication is through through HTTP and HTTPS REST services (except for its SQL JDBC feature). This includes all internal cluster communication. MarkLogic provides out-of-the-box REST interfaces for manipulating resources (insert, update, delete, query, search, transform, etc.) and administering MarkLogic (REST app servers, databases, indexes, clusters, etc). MarkLogic makes it very easy to create custom REST services because everything in MarkLogic is built around REST and because they provide simple and powerful application server APIs.

== MarkLogic Links ==
*[[MarkLogic]]
*[[MarkLogic Query and Search]]
*[[MarkLogic Training Resources]]
*[[Installing MarkLogic]]

[[Category: MarkLogic]]

MarkLogic Query and Search

2014-11-20T04:38:17Z

Bowersmt: Fixed category

== Searching and Querying MarkLogic ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. MarkLogic collections group documents together in any way you want. Collections are like views in a relational database, which group together a filtered set of rows from one or more tables, but collections do not join and merge document content and views cannot arbitrarily include any row from any table. A document in MarkLogic can have a flat structure like a relational table, or it can contain complex nestings of data. Each nesting of data in a document represents another table in a relational database. For example, a single document in MarkLogic may represent many relational tables, and how they are nested represents their one-to-one, many-to-one, and many-to-many relationships.

Both MarkLogic and relational databases use indexes to resolve queries without having to scan through all documents or rows. MarkLogic indexes are designed to find any word, phrase, value, element, or element structure — no matter where it occurs in document structure, no matter how many times it occurs within a document, and no matter what type of document contains it. For example, you can search for and return a <tt><title></tt> element and MarkLogic will return the contents of all <tt><title></tt> elements it finds — even if <tt><title></tt> elements are in different types of documents. This is similar to doing a union query across multiple tables in relational databases.

MarkLogic indexes also let you search for words, phrases, values, elements, or element structures in very specific locations, such as in only certain types of documents or within certain document structures, or in specific collections or folders. For example, you can search for and return all <tt><title></tt> elements that are found in the header section of poetry documents in the Shakespeare and Tennyson collections.

MarkLogic is not designed for joining documents like a relational database. A relational join searches for matching values across two or more tables and then merges matching rows into a single row. This can be done in MarkLogic, but you have to write advanced code to iterate through query results to match values across different document types, and merge matching documents into a single document. Also since MarkLogic documents are often nested structures, you can't simply merge them, you have to copy nested content from one document into a specific nested location in another.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all ''the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row look ups, but they become increasingly costly the more rows they retrieve. If an index has to look up 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following

# It validates the query request from the client
# It parses and optimizes the query
# It sends the query to all servers that have forests in the database
## Each server sends it to its forests
## Each forest sends it to each stand
## Each stand uses its indexes to find matching documents
### All stands and forests query the indexes in parallel
## Forests combine the matching document IDs from all their stands
## Forests sort the matching document IDs (in relevance score order or value order)
## When complete, each forest returns a sorted iterator to the initiating server
# When each forest has returned an iterator
# It walks the sorted iterators from all the forests and combines the results in sorted
# It retrieves matching documents from its ''Expanded Tree Cache ''or from the forests when documents are not cached
## forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
# It filters returned documents to remove false positive matches
# It optionally transforms documents
## It creates new documents by extracting matching elements, changing structure, changing file format, etc.
# It returns all matching (optionally transformed) documents to the requester
# It saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized ''evaluation process would compete with the ''parallel ''index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache ''to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
* E-nodes and d-nodes can be scaled independently to match the load.
* You can optimize an e-node by minimizing its ''Compressed Tree Cache ''and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache ''so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache ''because it doesn't have any data.
* You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache ''and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
* You can optimize a query by making it unfiltered.
** This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
* You can optimize a query by eliminating false positives.
** False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
** False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact ''match, MarkLogic must open the document and verify it.
** For example, suppose a query searches for the following path: <tt>books/poetry/poem</tt>The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <tt>books/poetry/fantasy/poetry/poem</tt>.
** You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name <tt>"poemTextLine"</tt>.
* You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
** It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
** It is expensive to process documents on the initiating server and the requester.
** It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
** The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
** The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map ''process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced ''at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses ''term'', ''range'', and ''semantic ''indexes.

* A '''term index '''associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
* A '''range index '''is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
* A '''semantic index '''is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one look up, it returns ''all ''documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, which it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for ''documents ''as if they contained only:

* '''Words''': a flat set of words without structure, such as <tt>["Mary", "had", "a", "little", "lamb"]</tt>
* '''Phrases''': a flat set of phrases without structure, such as <tt>["Mary had", "had a", "a little", "little lamb"]</tt>
* '''Elements''': a flat set of elements without values, such as <tt>["poem", "text", "line"]</tt>
* '''Element-values''': a flat set of elements with string values, such as <tt>"line": "Mary had a little lamb"</tt>
* '''Element-words''': a flat set of elements with words, such as <tt>"line": ["Mary", "had", "a", "little", "lamb"]</tt>
* '''Element-phrases''': a flat set of elements with phrases, such as <tt>"line": ["Mary had", "had a", "a little", "little lamb"]</tt>
* '''Element-value lexicon''': a flat set of elements with typed values, such as <tt>"lineSequence": 3 </tt>or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.
* '''XPath-structure-value lexicon''': a flat set of hierarchical elements with typed values, such as <tt>poem.text.line.lineSequence: 3 </tt>or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.
* '''Xpath-structure''': hierarchical structures without values, such as <tt>poem.text.line</tt>
* '''Xpath-structure-words''': hierarchical structures with words, such as <tt>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</tt>
* '''Xpath-structure-phrases''': hierarchical structures with phrases, such as <tt>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</tt>
* '''Xpath-structure-values''': hierarchical structures with values, such as <tt>poem.text.line: "Mary had a little lamb"</tt>
* '''RDF document links''' to other documents, such as <tt>{"subject": "[http://example.org/documents/thisDoc http://example.org/documents/thisDoc]", "predicate": "[http://example.org/predicates/relatesSomehowTo http://example.org/predicates/relatesSomehowTo]", "object": "[http://example.org/documents/thatDoc http://example.org/documents/thatDoc]"}</tt>
* '''RDF abstract links to abstract concepts''', such as <tt>{"subject": "[http://example.org/documents/thisDoc http://example.org/documents/thisDoc]", "predicate": "[http://example.org/predicates/relatesSomehowTo http://example.org/predicates/relatesSomehowTo]", "object": "[http://example.org/documents/someConceptWithNoDocumentAtTheURI http://example.org/documents/someConceptWithNoDocumentAtTheURI]"}</tt>
* '''RDF data links to data''', such as <tt>{"subject": "[http://example.org/documents/thisDoc http://example.org/documents/thisDoc]", "predicate": "[http://example.org/predicates/ageInYears http://example.org/predicates/ageInYears]", "object": 12}</tt>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Search Functions in MarkLogic ==
=== Executing a Search ===
* [http://docs.marklogic.com/cts:search cts:search] is the most important function because it executes a search and returns matching nodes.
** '''You must include an XPath expression''' to define which nodes to search within and to return.
*** The search returns nodes, which can be documents, elements, attributes, or text. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
** Y'''ou must include a cts query expression''' to filter the contents of nodes selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content.
*** MarkLogic limits search results to words, phrases, values, and structures in the query that occur within the nodes specified by the XPath expression.

*
**
*** If you want to return entire documents and search within specific elements and attributes, then the XPath expression should specify the root element and the query should use the element and element-attribute queries to limit where the search occurs.
*** If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: <tt>cts:and-query(())</tt>.

*
** You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
** Use Cases:
*** A search can return entire documents whose contents match the XPath and query expressions.
*** A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
*** A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.
** Notes:
*** There is no functional difference between searching entire documents and elements because an element has the same structure as document. Both may contain complex nested elements or both may simply contain a value. In both cases, MarkLogic retrieves entire documents. In the case of elements, it extracts the requested elements from each retrieved document.
*** Searching attributes, text nodes, and childless elements is similar to searching documents and elements with children — except the cts query cannot look for nested structures.
* [http://docs.marklogic.com/cts:contains cts:contains] is the second most important function because it returns true when a cts query finds at least one matching node in the nodes returned by an XPath expression (or a sequence of values). In other words, it returns true when at least one of the specified nodes ''contains'' the precise combination of words, phrases, values, structures, and/or triples in a cts query.
* NOTE: in the second example for [http://docs.marklogic.com/cts:not-query cts:not-query], it says cts:contains forces the constraint to happen in the filtering stage of the query. Is this always true for cts:contains, or is it only true in the example?

=== Constructing a Query ===
You compose a query using cts query constructor functions. You then execute the query by passing it into <tt>cts:search, cts:contains, search:search</tt>, and lexicon functions. MarkLogic has 30+ query constructors. They all end in "-query".

==== Composite Query Constructors ====
The composite query constructors build up new queries from other queries, and queries can be nested within queries.

* [http://docs.marklogic.com/cts:and-query cts:and-query] returns a query that intersects two or more sub-queries.
** You may optionally specify an "ordered" option that requires document contents to match sub-queries in the order they are listed; for example if the sub-queries are "To be" and "or not to be", then the query will match documents that have the phrase "To be" before "or not to be". The default is "unordered", which allows each sub-query to match the content in any order that it occurs in the document.
** If you pass in the empty sequence, such as <tt>cts:and-query(())</tt>, then it will match every document in the database.
* [http://docs.marklogic.com/cts:or-query cts:or-query] returns a query that unions two or more sub-queries.
* [http://docs.marklogic.com/cts:not-query cts:not-query] returns a query that subtracts one or more sub-queries.
** It works by taking the list of all documents in the database and subtracting the documents returned from its sub-queries. It returns all remaining documents. It does not filter documents returned from its sub-queries because filtering happens after all index resolution has completed, and <tt>cts:not-query </tt>executes during index resolution. Thus, only if its sub-queries return accurate results (with no false-positives), then the results of <tt>cts:not-query</tt> are accurate. Thus, it only produces accurate results when all its sub-queries are resolvable completely from indexes with no filtering required.
** '''Warning''': <tt>cts:not-query </tt>can be inaccurate. It can be missing some documents. This happens when one or more of its sub-queries require filtering (i.e. can't be resolved accurately using only indexes). When sub-queries require filtering, they may return false-positive matches. A false-positive match occurs when the indexes cannot determine for sure if a document matches; so MarkLogic includes the document just in case. As the last step in a query, MarkLogic opens all potentially matching documents and filters out false-positive matches. Thus, false-positives are not a match, they are a failed potential match. False positives can be prevented by writing sub-queries that are resolved completely by the indexes so that there is no need for filtering.
** Because <tt>cts:not-query </tt>runs during index resolution (before filtering occurs), it subtracts out any false-positive matches returned by its sub-queries. False-positives should not be subtracted out. But MarkLogic doesn't know which documents are false-positives until filtering, and it can't do filtering during <tt>cts:not-query </tt>because it is in the middle of doing index resolution. Thus, MarkLogic subtracts out false-positives and this causes <tt>cts:not-query </tt>to return too few documents. If we prevent sub-queries from returning false-positives, <tt>cts:not-query </tt>returns accurate results.
* [http://docs.marklogic.com/cts:and-not-query cts:and-not-query]
* [http://docs.marklogic.com/cts:element-query cts:element-query]
* [http://docs.marklogic.com/cts:document-fragment-query cts:document-fragment-query]
* [http://docs.marklogic.com/cts:locks-query cts:locks-query]
* [http://docs.marklogic.com/cts:properties-query cts:properties-query]
* [http://docs.marklogic.com/cts:near-query cts:near-query]

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

MarkLogic Query and Search

2014-11-20T04:37:15Z

Bowersmt:

MarkLogic Query and Search

2014-11-19T07:01:10Z

Bowersmt: Updated cts:not-query

[[MarkLogic|« Back to MarkLogic]]

== Searching and Querying MarkLogic ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. MarkLogic collections group documents together in any way you want. Collections are like views in a relational database, which group together a filtered set of rows from one or more tables, but collections do not join and merge document content and views cannot arbitrarily include any row from any table. A document in MarkLogic can have a flat structure like a relational table, or it can contain complex nestings of data. Each nesting of data in a document represents another table in a relational database. For example, a single document in MarkLogic may represent many relational tables, and how they are nested represents their one-to-one, many-to-one, and many-to-many relationships.

Both MarkLogic and relational databases use indexes to resolve queries without having to scan through all documents or rows. MarkLogic indexes are designed to find any word, phrase, value, element, or element structure -- no matter where it occurs in document structure, no matter how many times it occurs within a document, and no matter what type of document contains it. For example, you can search for and return a <title> element and MarkLogic will return the contents of all <title> elements it finds — even if <title> elements are in different types of documents. This is similar to doing a union query across multiple tables in relational databases.

MarkLogic indexes also let you search for words, phrases, values, elements, or element structures in very specific locations, such as in only certain types of documents or within certain document structures, or in specific collections or folders. For example, you can search for and return all <title> elements that are found in the header section of poetry documents in the Shakespeare and Tennyson collections.

MarkLogic is not designed for joining documents like a relational database. A relational join searches for matching values across two or more tables and then merges matching rows into a single row. This can be done in MarkLogic, but you have to write advanced code to iterate through query results to match values across different document types, and merge matching documents into a single document. Also since MarkLogic documents are often nested structures, you can't simply merge them, you have to copy nested content from one document into a specific nested location in another.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#It validates the query request from the client
#It parses and optimizes the query
#It sends the query to all servers that have forests in the database
##Each server sends it to its forests
##Each forest sends it to each stand
##Each stand uses its indexes to find matching documents
###All stands and forests query the indexes in parallel
##Forests combine the matching document IDs from all their stands
##Forests sort the matching document IDs (in relevance score order or value order)
##When complete, each forest returns a sorted iterator to the initiating server
#When each forest has returned an iterator
#It walks the sorted iterators from all the forests and combines the results in sorted
#It retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#It filters returned documents to remove false positive matches
#It optionally transforms documents
##It creates new documents by extracting matching elements, changing structure, changing file format, etc.
#It returns all matching (optionally transformed) documents to the requester
#It saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, which it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''

*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>

*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>

*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>

*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>

*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>

*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>

*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.

*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.

*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>

*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>

*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>

*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>

*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>

*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>

*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Search Functions in MarkLogic ==
=== Executing a Search ===
*'''<code>[http://docs.marklogic.com/cts:search cts:search]</code>''' is the most important function because it executes a search and returns matching nodes.
**'''You must include an XPath expression''' to define which nodes to ''search'' within and to ''return''.
***The search returns ''nodes'', which can be documents, elements, attributes, or text. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
**'''You must include a cts query expression''' to filter the contents of ''nodes'' selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content.
***MarkLogic limits search results to words, phrases, values, and structures in the query that occur within the nodes specified by the XPath expression.
***If you want to return entire documents and search within specific elements and attributes, then the XPath expression should specify the root element and the query should use the element and element-attribute queries to limit where the search occurs.
***If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: <code>cts:and-query(())</code>.
**You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
**Use Cases:
***A search can return entire documents whose contents match the XPath and query expressions.
***A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
***A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.
**Notes:
***There is no functional difference between searching entire documents and elements because an element has the same structure as document. Both may contain complex nested elements or both may simply contain a value. In both cases, MarkLogic retrieves entire documents. In the case of elements, it extracts the requested elements from each retrieved document.
***Searching attributes, text nodes, and childless elements is similar to searching documents and elements with children — except the cts query cannot look for nested structures.

*'''<code>[http://docs.marklogic.com/cts:contains cts:contains]</code>''' is the second most important function because it returns true when a cts query finds at least one matching node in the nodes returned by an XPath expression (or a sequence of values). In other words, it returns true when at least one of the specified nodes '''contains''' the precise combination of words, phrases, values, structures, and/or triples in a cts query.
**NOTE: in the second example for <code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>, it says cts:contains forces the constraint to happen in the filtering stage of the query. Is this always true for cts:contains, or is it only true in the example?

=== Constructing a Query ===
You compose a query using cts query constructor functions. You then execute the query by passing it into <code>cts:search, cts:contains, search:search</code>, and lexicon functions. MarkLogic has 30+ query constructors. They all end in '''"-query"'''.

==== Composite Query Constructors ====
The composite query constructors build up new queries from other queries, and queries can be nested within queries.

*'''<code>[http://docs.marklogic.com/cts:and-query cts:and-query]</code>''' returns a query that intersects two or more sub-queries.
**You may optionally specify an "ordered" option that requires document contents to match sub-queries in the order they are listed; for example if the sub-queries are "To be" and "or not to be", then the query will match documents that have the phrase "To be" ''before'' "or not to be". The default is "unordered", which allows each sub-query to match the content in any order that it occurs in the document.
**If you pass in the empty sequence, such as <code>cts:and-query(())</code>, then it will match every document in the database.

*'''<code>[http://docs.marklogic.com/cts:or-query cts:or-query]</code>''' returns a query that unions two or more sub-queries.

*'''<code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>''' returns a query that subtracts one or more sub-queries.
**It works by taking the list of all documents in the database and subtracting the documents returned from its sub-queries. It returns all remaining documents. It does not filter documents returned from its sub-queries because filtering happens after all index resolution has completed, and <code>cts:not-query</code> executes during index resolution. Thus, only if its sub-queries return accurate results (with no false-positives), then the results of <code>cts:not-query</code> are accurate. Thus, it only produces accurate results when all its sub-queries are resolvable completely from indexes with no filtering required.
**'''Warning:''' cts:not-query can be inaccurate. It can be missing some documents. This happens when one or more of its sub-queries require filtering (i.e. can't be resolved accurately using only indexes). When sub-queries require filtering, they may return false-positive matches. A false-positive match occurs when the indexes cannot determine for sure if a document matches; so MarkLogic includes the document just in case. As the last step in a query, MarkLogic opens all potentially matching documents and filters out false-positive matches. Thus, false-positives are not a match, they are a failed potential match. False positives can be prevented by writing sub-queries that are resolved completely by the indexes so that there is no need for filtering.
***Because <code>cts:not-query</code> runs during index resolution (before filtering occurs), it subtracts out any false-positive matches returned by its sub-queries. False-positives should not be subtracted out. But MarkLogic doesn't know which documents are false-positives until filtering, and it can't do filtering during <code>cts:not-query</code> because it is in the middle of doing index resolution. Thus, MarkLogic subtracts out false-positives and this causes <code>cts:not-query</code> to return too few documents. If we prevent sub-queries from returning false-positives, <code>cts:not-query</code> returns accurate results.

*'''<code>[http://docs.marklogic.com/cts:and-not-query cts:and-not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:element-query cts:element-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:document-fragment-query cts:document-fragment-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:locks-query cts:locks-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:properties-query cts:properties-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:near-query cts:near-query]</code>'''

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

MarkLogic Query and Search

2014-11-19T06:18:40Z

Bowersmt: Updated cts:not-query

[[MarkLogic|« Back to MarkLogic]]

== Searching and Querying MarkLogic ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. MarkLogic collections group documents together in any way you want. Collections are like views in a relational database, which group together a filtered set of rows from one or more tables, but collections do not join and merge document content and views cannot arbitrarily include any row from any table. A document in MarkLogic can have a flat structure like a relational table, or it can contain complex nestings of data. Each nesting of data in a document represents another table in a relational database. For example, a single document in MarkLogic may represent many relational tables, and how they are nested represents their one-to-one, many-to-one, and many-to-many relationships.

Both MarkLogic and relational databases use indexes to resolve queries without having to scan through all documents or rows. MarkLogic indexes are designed to find any word, phrase, value, element, or element structure -- no matter where it occurs in document structure, no matter how many times it occurs within a document, and no matter what type of document contains it. For example, you can search for and return a <title> element and MarkLogic will return the contents of all <title> elements it finds — even if <title> elements are in different types of documents. This is similar to doing a union query across multiple tables in relational databases.

MarkLogic indexes also let you search for words, phrases, values, elements, or element structures in very specific locations, such as in only certain types of documents or within certain document structures, or in specific collections or folders. For example, you can search for and return all <title> elements that are found in the header section of poetry documents in the Shakespeare and Tennyson collections.

MarkLogic is not designed for joining documents like a relational database. A relational join searches for matching values across two or more tables and then merges matching rows into a single row. This can be done in MarkLogic, but you have to write advanced code to iterate through query results to match values across different document types, and merge matching documents into a single document. Also since MarkLogic documents are often nested structures, you can't simply merge them, you have to copy nested content from one document into a specific nested location in another.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#It validates the query request from the client
#It parses and optimizes the query
#It sends the query to all servers that have forests in the database
##Each server sends it to its forests
##Each forest sends it to each stand
##Each stand uses its indexes to find matching documents
###All stands and forests query the indexes in parallel
##Forests combine the matching document IDs from all their stands
##Forests sort the matching document IDs (in relevance score order or value order)
##When complete, each forest returns a sorted iterator to the initiating server
#When each forest has returned an iterator
#It walks the sorted iterators from all the forests and combines the results in sorted
#It retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#It filters returned documents to remove false positive matches
#It optionally transforms documents
##It creates new documents by extracting matching elements, changing structure, changing file format, etc.
#It returns all matching (optionally transformed) documents to the requester
#It saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, which it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''

*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>

*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>

*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>

*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>

*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>

*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>

*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.

*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.

*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>

*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>

*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>

*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>

*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>

*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>

*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Search Functions in MarkLogic ==
=== Executing a Search ===
*'''<code>[http://docs.marklogic.com/cts:search cts:search]</code>''' is the most important function because it executes a search and returns matching nodes.
**'''You must include an XPath expression''' to define which nodes to ''search'' within and to ''return''.
***The search returns ''nodes'', which can be documents, elements, attributes, or text. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
**'''You must include a cts query expression''' to filter the contents of ''nodes'' selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content.
***MarkLogic limits search results to words, phrases, values, and structures in the query that occur within the nodes specified by the XPath expression.
***If you want to return entire documents and search within specific elements and attributes, then the XPath expression should specify the root element and the query should use the element and element-attribute queries to limit where the search occurs.
***If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: <code>cts:and-query(())</code>.
**You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
**Use Cases:
***A search can return entire documents whose contents match the XPath and query expressions.
***A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
***A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.
**Notes:
***There is no functional difference between searching entire documents and elements because an element has the same structure as document. Both may contain complex nested elements or both may simply contain a value. In both cases, MarkLogic retrieves entire documents. In the case of elements, it extracts the requested elements from each retrieved document.
***Searching attributes, text nodes, and childless elements is similar to searching documents and elements with children — except the cts query cannot look for nested structures.

*'''<code>[http://docs.marklogic.com/cts:contains cts:contains]</code>''' is the second most important function because it returns true when a cts query finds at least one matching node in the nodes returned by an XPath expression (or a sequence of values). In other words, it returns true when at least one of the specified nodes '''contains''' the precise combination of words, phrases, values, structures, and/or triples in a cts query.
**NOTE: in the second example for <code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>, it says cts:contains forces the constraint to happen in the filtering stage of the query. Is this always true for cts:contains, or is it only true in the example?

=== Constructing a Query ===
You compose a query using cts query constructor functions. You then execute the query by passing it into <code>cts:search, cts:contains, search:search</code>, and lexicon functions. MarkLogic has 30+ query constructors. They all end in '''"-query"'''.

==== Composite Query Constructors ====
The composite query constructors build up new queries from other queries, and queries can be nested within queries.

*'''<code>[http://docs.marklogic.com/cts:and-query cts:and-query]</code>''' returns a query that intersects two or more sub-queries.
**You may optionally specify an "ordered" option that requires document contents to match sub-queries in the order they are listed; for example if the sub-queries are "To be" and "or not to be", then the query will match documents that have the phrase "To be" ''before'' "or not to be". The default is "unordered", which allows each sub-query to match the content in any order that it occurs in the document.
**If you pass in the empty sequence, such as <code>cts:and-query(())</code>, then it will match every document in the database.

*'''<code>[http://docs.marklogic.com/cts:or-query cts:or-query]</code>''' returns a query that unions two or more sub-queries.

*'''<code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>''' returns a query that subtracts one or more sub-queries.
**It works by taking the list of documents in the database and subtracting documents returned from its sub-queries. It returns all remaining documents. It does not filter documents returned from its sub-queries because filtering happens after all index resolution has completed, and <code>cts:not-query</code) executes during index resolution.
**'''Warning:''' cts:not-query can be inaccurate. It can be missing documents when any of its sub-queries requires filtering to be accurate. It only produces accurate results when all its sub-queries are resolvable completely from indexes with no filtering required.
***When sub-queries require filtering, they may return false-positive matches. A false-positive match occurs when the indexes cannot determine for sure if a document matches so MarkLogic includes it just in case. As the last step in a query, MarkLogic opens all the potentially matching documents returned by the indexes and filters out any false-positive matches.
***When <code>cts:not-query</code> subtracts its sub-queries from the documents in the database, it also subtracts false-positive matches because it runs during index resolution and before filtering occurs. If there are false-positive matches, then <code>cts:not-query</code> won't include them. If the sub-queries return inaccurate false-positive matches, then they get subtracted out of the results, which creates inaccurate results. When the sub-queries return accurate results (with no false-positives), then the results of <code>cts:not-query</code> are accurate.

*'''<code>[http://docs.marklogic.com/cts:and-not-query cts:and-not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:element-query cts:element-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:document-fragment-query cts:document-fragment-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:locks-query cts:locks-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:properties-query cts:properties-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:near-query cts:near-query]</code>'''

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

MarkLogic Query and Search

2014-11-19T04:49:10Z

Bowersmt: Clarified how cts:search works

[[MarkLogic|« Back to MarkLogic]]

== Searching and Querying MarkLogic ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. MarkLogic collections group documents together in any way you want. Collections are like views in a relational database, which group together a filtered set of rows from one or more tables, but collections do not join and merge document content and views cannot arbitrarily include any row from any table. A document in MarkLogic can have a flat structure like a relational table, or it can contain complex nestings of data. Each nesting of data in a document represents another table in a relational database. For example, a single document in MarkLogic may represent many relational tables, and how they are nested represents their one-to-one, many-to-one, and many-to-many relationships.

Both MarkLogic and relational databases use indexes to resolve queries without having to scan through all documents or rows. MarkLogic indexes are designed to find any word, phrase, value, element, or element structure -- no matter where it occurs in document structure, no matter how many times it occurs within a document, and no matter what type of document contains it. For example, you can search for and return a <title> element and MarkLogic will return the contents of all <title> elements it finds — even if <title> elements are in different types of documents. This is similar to doing a union query across multiple tables in relational databases.

MarkLogic indexes also let you search for words, phrases, values, elements, or element structures in very specific locations, such as in only certain types of documents or within certain document structures, or in specific collections or folders. For example, you can search for and return all <title> elements that are found in the header section of poetry documents in the Shakespeare and Tennyson collections.

MarkLogic is not designed for joining documents like a relational database. A relational join searches for matching values across two or more tables and then merges matching rows into a single row. This can be done in MarkLogic, but you have to write advanced code to iterate through query results to match values across different document types, and merge matching documents into a single document. Also since MarkLogic documents are often nested structures, you can't simply merge them, you have to copy nested content from one document into a specific nested location in another.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#It validates the query request from the client
#It parses and optimizes the query
#It sends the query to all servers that have forests in the database
##Each server sends it to its forests
##Each forest sends it to each stand
##Each stand uses its indexes to find matching documents
###All stands and forests query the indexes in parallel
##Forests combine the matching document IDs from all their stands
##Forests sort the matching document IDs (in relevance score order or value order)
##When complete, each forest returns a sorted iterator to the initiating server
#When each forest has returned an iterator
#It walks the sorted iterators from all the forests and combines the results in sorted
#It retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#It filters returned documents to remove false positive matches
#It optionally transforms documents
##It creates new documents by extracting matching elements, changing structure, changing file format, etc.
#It returns all matching (optionally transformed) documents to the requester
#It saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, which it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''

*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>

*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>

*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>

*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>

*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>

*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>

*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.

*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.

*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>

*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>

*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>

*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>

*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>

*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>

*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Search Functions in MarkLogic ==
=== Executing a Search ===
*'''<code>[http://docs.marklogic.com/cts:search cts:search]</code>''' is the most important function because it executes a search and returns matching nodes.
**'''You must include an XPath expression''' to define which nodes to ''search'' within and to ''return''.
***The search returns ''nodes'', which can be documents, elements, attributes, or text. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
**'''You must include a cts query expression''' to filter the contents of ''nodes'' selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content.
***MarkLogic limits search results to words, phrases, values, and structures in the query that occur within the nodes specified by the XPath expression.
***If you want to return entire documents and search within specific elements and attributes, then the XPath expression should specify the root element and the query should use the element and element-attribute queries to limit where the search occurs.
***If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: <code>cts:and-query(())</code>.
**You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
**Use Cases:
***A search can return entire documents whose contents match the XPath and query expressions.
***A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
***A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.
**Notes:
***There is no functional difference between searching entire documents and elements because an element has the same structure as document. Both may contain complex nested elements or both may simply contain a value. In both cases, MarkLogic retrieves entire documents. In the case of elements, it extracts the requested elements from each retrieved document.
***Searching attributes, text nodes, and childless elements is similar to searching documents and elements with children — except the cts query cannot look for nested structures.

*'''<code>[http://docs.marklogic.com/cts:contains cts:contains]</code>''' is the second most important function because it returns true when a cts query finds at least one matching node in the nodes returned by an XPath expression (or a sequence of values). In other words, it returns true when at least one of the specified nodes '''contains''' the precise combination of words, phrases, values, structures, and/or triples in a cts query.

=== Constructing a Query ===
You compose a query using cts query constructor functions. You then execute the query by passing it into <code>cts:search, cts:contains, search:search</code>, and lexicon functions. MarkLogic has 30+ query constructors. They all end in '''"-query"'''.

==== Composite Query Constructors ====
The composite query constructors build up new queries from other queries, and queries can be nested within queries.

*'''<code>[http://docs.marklogic.com/cts:and-query cts:and-query]</code>''' returns a query that intersects two or more sub-queries.
**You may optionally specify an "ordered" option that requires document contents to match sub-queries in the order they are listed; for example if the sub-queries are "To be" and "or not to be", then the query will match documents that have the phrase "To be" ''before'' "or not to be". The default is "unordered", which allows each sub-query to match the content in any order that it occurs in the document.
**If you pass in the empty sequence, such as <code>cts:and-query(())</code>, then it will match every document in the database.

*'''<code>[http://docs.marklogic.com/cts:or-query cts:or-query]</code>''' returns a query that unions two or more sub-queries.

*'''<code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>''' returns a query that subtracts one or more sub-queries.

*'''<code>[http://docs.marklogic.com/cts:and-not-query cts:and-not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:element-query cts:element-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:document-fragment-query cts:document-fragment-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:locks-query cts:locks-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:properties-query cts:properties-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:near-query cts:near-query]</code>'''

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

MarkLogic Query and Search

2014-11-18T15:21:28Z

Bowersmt: Minor changes

[[MarkLogic|« Back to MarkLogic]]

== Searching and Querying MarkLogic ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. MarkLogic collections group documents together in any way you want. Collections are like views in a relational database, which group together a filtered set of rows from one or more tables, but collections do not join and merge document content and views cannot arbitrarily include any row from any table. A document in MarkLogic can have a flat structure like a relational table, or it can contain complex nestings of data. Each nesting of data in a document represents another table in a relational database. For example, a single document in MarkLogic may represent many relational tables, and how they are nested represents their one-to-one, many-to-one, and many-to-many relationships.

Both MarkLogic and relational databases use indexes to resolve queries without having to scan through all documents or rows. MarkLogic indexes are designed to find any word, phrase, value, element, or element structure -- no matter where it occurs in document structure, no matter how many times it occurs within a document, and no matter what type of document contains it. For example, you can search for and return a <title> element and MarkLogic will return the contents of all <title> elements it finds — even if <title> elements are in different types of documents. This is similar to doing a union query across multiple tables in relational databases.

MarkLogic indexes also let you search for words, phrases, values, elements, or element structures in very specific locations, such as in only certain types of documents or within certain document structures, or in specific collections or folders. For example, you can search for and return all <title> elements that are found in the header section of poetry documents in the Shakespeare and Tennyson collections.

MarkLogic is not designed for joining documents like a relational database. A relational join searches for matching values across two or more tables and then merges matching rows into a single row. This can be done in MarkLogic, but you have to write advanced code to iterate through query results to match values across different document types, and merge matching documents into a single document. Also since MarkLogic documents are often nested structures, you can't simply merge them, you have to copy nested content from one document into a specific nested location in another.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#It validates the query request from the client
#It parses and optimizes the query
#It sends the query to all servers that have forests in the database
##Each server sends it to its forests
##Each forest sends it to each stand
##Each stand uses its indexes to find matching documents
###All stands and forests query the indexes in parallel
##Forests combine the matching document IDs from all their stands
##Forests sort the matching document IDs (in relevance score order or value order)
##When complete, each forest returns a sorted iterator to the initiating server
#When each forest has returned an iterator
#It walks the sorted iterators from all the forests and combines the results in sorted
#It retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#It filters returned documents to remove false positive matches
#It optionally transforms documents
##It creates new documents by extracting matching elements, changing structure, changing file format, etc.
#It returns all matching (optionally transformed) documents to the requester
#It saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, whcih it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''

*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>

*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>

*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>

*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>

*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>

*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>

*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.

*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.

*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>

*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>

*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>

*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>

*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>

*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>

*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Search Functions in MarkLogic ==
=== Executing a Search ===
*'''<code>[http://docs.marklogic.com/cts:search cts:search]</code>''' is the most important function because it executes a search and returns matching nodes.
**'''You must include an XPath expression''' to define which nodes to ''search'' and ''return''.
**The search returns ''nodes'', which can be documents, elements, attributes, or text. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
**'''You must include a cts query expression''' to filter the contents of ''nodes'' selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content. If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: <code>cts:and-query(())</code>.
**You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
**Use Cases:
***A search can return entire documents whose contents match the XPath and query expressions.
***A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
***A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.
**Notes:
***There is no functional difference between searching entire documents and elements because an element has the same structure as document. Both may contain complex nested elements or both may simply contain a value. In both cases, MarkLogic retrieves entire documents. In the case of elements, it extracts the requested elements from each retrieved document.
***Searching attributes, text nodes, and childless elements is similar to searching documents and elements with children — except the cts query cannot look for nested structures.

*'''<code>[http://docs.marklogic.com/cts:contains cts:contains]</code>''' is the second most important function because it returns true when a cts query finds at least one matching node in the nodes returned by an XPath expression (or a sequence of values). In other words, it returns true when at least one of the specified nodes '''contains''' the precise combination of words, phrases, values, structures, and/or triples in a cts query.

=== Constructing a Query ===
You compose a query using cts query constructor functions. You then execute the query by passing it into <code>cts:search, cts:contains, search:search</code>, and lexicon functions. MarkLogic has 30+ query constructors. They all end in '''"-query"'''.

==== Composite Query Constructors ====
The composite query constructors build up new queries from other queries, and queries can be nested within queries.

*'''<code>[http://docs.marklogic.com/cts:and-query cts:and-query]</code>''' returns a query that intersects two or more sub-queries. You may optionally specify an "ordered" option that requires document contents to match sub-queries in the order they are listed; for example if the sub-queries are "To be" and "or not to be", then the query will match documents that have the phrase "To be" ''before'' "or not to be". The default is "unordered", which allows each sub-query to match the content in any order that it occurs in the document. If you pass in the empty sequence, such as <code>cts:and-query(())</code>, then it will match every document in the database.

*'''<code>[http://docs.marklogic.com/cts:or-query cts:or-query]</code>''' returns a query that unions two or more sub-queries.

*'''<code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:and-not-query cts:and-not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:element-query cts:element-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:document-fragment-query cts:document-fragment-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:locks-query cts:locks-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:properties-query cts:properties-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:near-query cts:near-query]</code>'''

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

MarkLogic Query and Search

2014-11-15T22:54:25Z

Bowersmt: Added more information about searching

[[MarkLogic|« Back to MarkLogic]]

== Searching and Querying MarkLogic ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. MarkLogic collections group documents together in any way you want. Collections are like views in a relational database, which group together a filtered set of rows from one or more tables, but collections do not join and merge document content and views cannot arbitrarily include any row from any table. A document in MarkLogic can have a flat structure like a relational table, or it can contain complex nestings of data. Each nesting of data in a document represents another table in a relational database. For example, a single document in MarkLogic may represent many relational tables, and how they are nested represents their one-to-one, many-to-one, and many-to-many relationships.

Both MarkLogic and relational databases use indexes to resolve queries without having to scan through all documents or rows. MarkLogic indexes are designed to find any word, phrase, value, element, or element structure -- no matter where it occurs in document structure, no matter how many times it occurs within a document, and no matter what type of document contains it. For example, you can search for and return a <title> element and MarkLogic will return the contents of all <title> elements it finds — even if <title> elements are in different types of documents. This is similar to doing a union query across multiple tables in relational databases.

MarkLogic indexes also let you search for words, phrases, values, elements, or element structures in very specific locations, such as in only certain types of documents or within certain document structures, or in specific collections or folders. For example, you can search for and return all <title> elements that are found in the header section of poetry documents in the Shakespeare and Tennyson collections.

MarkLogic is not designed for joining documents like a relational database. A relational join searches for matching values across two or more tables and then merges matching rows into a single row. This can be done in MarkLogic, but you have to write advanced code to iterate through query results to match values across different document types, and merge matching documents into a single document. Also since MarkLogic documents are often nested structures, you can't simply merge them, you have to copy nested content from one document into a specific nested location in another.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#It validates the query request from the client
#It parses and optimizes the query
#It sends the query to all servers that have forests in the database
##Each server sends it to its forests
##Each forest sends it to each stand
##Each stand uses its indexes to find matching documents
###All stands and forests query the indexes in parallel
##Forests combine the matching document IDs from all their stands
##Forests sort the matching document IDs (in relevance score order or value order)
##When complete, each forest returns a sorted iterator to the initiating server
#When each forest has returned an iterator
#It walks the sorted iterators from all the forests and combines the results in sorted
#It retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#It filters returned documents to remove false positive matches
#It optionally transforms documents
##It creates new documents by extracting matching elements, changing structure, changing file format, etc.
#It returns all matching (optionally transformed) documents to the requester
#It saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, whcih it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''

*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>

*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>

*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>

*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>

*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>

*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>

*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.

*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.

*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>

*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>

*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>

*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>

*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>

*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>

*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Search Functions in MarkLogic ==
=== Executing a Search ===
*'''<code>[http://docs.marklogic.com/cts:search cts:search]</code>''' is the most important function because it executes a search and returns matching nodes.
**'''You must include an XPath expression''' to define which nodes to ''search'' and ''return''.
**The search returns ''nodes'', which can be documents, elements, attributes, or text. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
**'''You must include a cts query expression''' to filter the contents of ''nodes'' selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content. If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: <code>cts:and-query(())</code>.
**You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
**Use Cases:
***A search can return entire documents whose contents match the XPath and query expressions.
***A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
***A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.
**Notes:
***There is no functional difference between searching entire documents and elements because an element has the same structure as document. Both may contain complex nested elements or both may simply contain a value. In both cases, MarkLogic retrieves entire documents. In the case of elements, it extracts the requested elements from each retrieved document.
***Searching attributes, text nodes, and childless elements is similar to searching documents and elements with children — except the cts query cannot look for nested structures.

*'''<code>[http://docs.marklogic.com/cts:contains cts:contains]</code>''' is the second most important function because it returns true when a cts query finds at least one matching node in the nodes returned by an XPath expression (or a sequence of values). In other words, it returns true when at least one of the specified nodes '''contains''' the precise combination of words, phrases, values, structures, and/or triples in a cts query.

=== Constructing a Query ===
You compose a query using cts query constructor functions. You then execute the query by passing it into <code>cts:search, cts:contains, search:search</code>, and lexicon functions. MarkLogic has 30+ query constructors. They all end in '''"-query"'''.

==== Composite Query Constructors ====
The composite query constructors build up new queries from other queries, and queries can be nested within queries.

*'''<code>[http://docs.marklogic.com/cts:and-query cts:and-query]</code>''' returns a query that intersects two or more sub-queries.

*'''<code>[http://docs.marklogic.com/cts:or-query cts:or-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:and-not-query cts:and-not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:element-query cts:element-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:document-fragment-query cts:document-fragment-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:locks-query cts:locks-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:properties-query cts:properties-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:near-query cts:near-query]</code>'''

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

MarkLogic Query and Search

2014-11-15T18:02:20Z

Bowersmt: Added more info on contains()

[[MarkLogic|« Back to MarkLogic]]

== Searching and Querying MarkLogic ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. MarkLogic collections group documents together, and this is a little bit like views in a relational database, which group together a filtered set of rows from one or more tables. A document in MarkLogic can have a flat structure like a relational table, or it can contain complex nestings of data. Each nesting of data in a document represents another table in a relational database. For example, a single document in MarkLogic may represent many relational tables and its nested structure represents their one-to-one, many-to-one, and many-to-many relationships. Both MarkLogic and relational databases use indexes to resolve queries without having to scan through rows or documents.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#it validates the query request from the client
#it parses and optimizes the query
#it sends the query to all servers that have forests in the database
##each server sends it to its forests
##each forest sends it to each stand
##each stand uses its indexes to find matching documents
###all stands and forests query the indexes in parallel
##forests combine the matching document IDs from all their stands
##forests sort the matching document IDs (in relevance score order or value order)
##when complete, each forest returns a sorted iterator to the initiating server
#when each forest has returned an iterator
#it walks the sorted iterators from all the forests and combines the results in sorted
#it retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#it filters returned documents to remove false positive matches
#it optionally transforms documents
##it creates new documents by extracting matching elements, changing structure, changing file format, etc.
#it returns all matching (optionally transformed) documents to the requester
#it saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, whcih it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''

*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>

*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>

*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>

*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>

*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>

*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>

*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.

*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.

*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>

*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>

*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>

*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>

*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>

*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>

*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Search Functions in MarkLogic ==
=== Executing a Search ===
*'''<code>[http://docs.marklogic.com/cts:search cts:search]</code>''' is the most important function because it executes a search and returns matching nodes.
**'''You must include an XPath expression''' to define which nodes to ''search'' and ''return''.
**The search returns ''nodes'', which can be documents, elements, attributes, or text. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
**'''You must include a cts query expression''' to filter the contents of ''nodes'' selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content. If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: <code>cts:and-query(())</code>.
**You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
**Use Cases:
***A search can return entire documents whose contents match the XPath and query expressions.
***A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
***A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.
**Notes:
***There is no functional difference between searching entire documents and elements because an element has the same structure as document. Both may contain complex nested elements or both may simply contain a value. In both cases, MarkLogic retrieves entire documents. In the case of elements, it extracts the requested elements from each retrieved document.
***Searching attributes, text nodes, and childless elements is similar to searching documents and elements with children — except the cts query cannot look for nested structures.

*'''<code>[http://docs.marklogic.com/cts:contains cts:contains]</code>''' is the second most important function because it returns true when a cts query finds at least one matching node in the nodes returned by an XPath expression (or a sequence of values). In other words, it returns true when at least one of the specified nodes '''contains''' the precise combination of words, phrases, values, structures, and/or triples in a cts query.

=== Constructing a Query ===
You compose a query using cts query constructor functions. You then execute the query by passing it into <code>cts:search, cts:contains, search:search</code>, and lexicon functions. MarkLogic has 30+ query constructors. They all end in '''"-query"'''.

==== Composite Query Constructors ====
The composite query constructors build up new queries from other queries, and queries can be nested within queries.

*'''<code>[http://docs.marklogic.com/cts:and-query cts:and-query]</code>''' returns a query that intersects two or more sub-queries.

*'''<code>[http://docs.marklogic.com/cts:or-query cts:or-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:and-not-query cts:and-not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:element-query cts:element-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:document-fragment-query cts:document-fragment-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:locks-query cts:locks-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:properties-query cts:properties-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:near-query cts:near-query]</code>'''

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

MarkLogic Query and Search

2014-11-15T07:12:09Z

Bowersmt: Added some search functions

[[MarkLogic|« Back to MarkLogic]]

== Searching and Querying MarkLogic ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. Both use indexes to resolve queries without having to scan through rows or documents.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#it validates the query request from the client
#it parses and optimizes the query
#it sends the query to all servers that have forests in the database
##each server sends it to its forests
##each forest sends it to each stand
##each stand uses its indexes to find matching documents
###all stands and forests query the indexes in parallel
##forests combine the matching document IDs from all their stands
##forests sort the matching document IDs (in relevance score order or value order)
##when complete, each forest returns a sorted iterator to the initiating server
#when each forest has returned an iterator
#it walks the sorted iterators from all the forests and combines the results in sorted
#it retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#it filters returned documents to remove false positive matches
#it optionally transforms documents
##it creates new documents by extracting matching elements, changing structure, changing file format, etc.
#it returns all matching (optionally transformed) documents to the requester
#it saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, whcih it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''

*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>

*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>

*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>

*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>

*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>

*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>

*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.

*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.

*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>

*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>

*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>

*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>

*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>

*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>

*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Search Functions in MarkLogic ==
=== Executing a Search ===
*'''<code>[http://docs.marklogic.com/cts:search cts:search]</code>''' is the most important function because it executes a search and returns matching nodes.
**'''You must include an XPath expression''' to define which nodes to ''search'' and ''return''.
**The search returns ''nodes'', which can be documents, elements, attributes, text, processing instructions, comments, and namespaces. Thus, a search doesn't have to return entire documents. Depending on the scope of your XPath statement, it may return entire documents, or one matching node from each document, or many matching nodes from each document.
**'''You must include a cts query expression''' to filter the contents of ''nodes'' selected by the XPath expression. This allows you to search for specific content within specific nodes and return only those nodes that contain that content. If you want to return all the nodes selected by the XPath expression, you can use the empty query expression: <code>cts:and-query(())</code>.
**You may limit the search to specific forests. You may set options like unfiltered, faceted, unchecked, quality weight, relevance scoring method, and relevance trace.
**Use Cases:
***A search can return entire documents whose contents match the XPath and query expressions.
***A search can extract and return titles from documents as long as the contents of the titles match the XPath and query expressions.
***A search can extract and return all the paragraphs from documents as long as the contents of the paragraphs match the XPath and query expressions.

*'''<code>[http://docs.marklogic.com/cts:contains cts:contains]</code>''' is the second most important function because it returns true when any of a sequence of documents or values contains words, phrases, values, structures or triples specified by a cts query.

=== Constructing a Query ===
You compose a query using cts query constructor functions. You then execute the query by passing it into <code>cts:search, cts:contains, search:search</code>, and lexicon functions. MarkLogic has 30+ query constructors. They all end in '''"-query"'''.

==== Composite Query Constructors ====
The composite query constructors build up new queries from other queries, and queries can be nested within queries.

*'''<code>[http://docs.marklogic.com/cts:and-query cts:and-query]</code>''' returns a query that intersects two or more sub-queries.

*'''<code>[http://docs.marklogic.com/cts:or-query cts:or-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:not-query cts:not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:and-not-query cts:and-not-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:element-query cts:element-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:document-fragment-query cts:document-fragment-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:locks-query cts:locks-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:properties-query cts:properties-query]</code>'''

*'''<code>[http://docs.marklogic.com/cts:near-query cts:near-query]</code>'''

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

Installing MarkLogic

2014-11-14T08:55:27Z

Bowersmt: Added back to MarkLogic

[[MarkLogic|« Back to MarkLogic]]

== Installing MarkLogic on Windows ==
(These instructions are for Windows XP, Vista, and 7. Installation and server start-up may differ slightly for other operating systems.)

# [http://developer.marklogic.com/products Download] the latest version of the MarkLogic Server for your operating system.
#* Upon clicking on the download link, you will need to agree to MarkLogic's Terms of Use.
# Execute the MarkLogic Server installer.
#* Choose a "Typical" setup. This will take about 5 minutes.
# After installation is complete start the Mark Logic Server.
#* (Windows Only) Go to Start > All Programs > MarkLogic Server > Start MarkLogic Server
#** Important! Right-click on "Start MarkLogic Server" and select "Run as administrator". Otherwise, the server may not start.
#** If you get the error message "The application failed to initialize properly...", then see [http://developer.marklogic.com/products/marklogic-server/requirements MarkLogic Server 4.x System Requirements] to download and install the necessary dll for your operating system. Then try starting the server again.
#** If start-up succeeds, you will not see any message.
# Go to the Mark Logic administration console:
#* (Windows Only) Go to Start > All Program > MarkLogic Server > Admin MarkLogic Server.
#* A browser window will open and you will be prompted to enter a license key.
#** If you do not yet have a license, do one of the following:
#*** (Employees Only) Request a license key by emailing [mailto:DL-ICS-MARKLOGIC DL-ICS-MARKLOGIC].
#**** Enter the licensee and the license key, then click "OK". The MarkLogic Server will restart.
#*** (Non-employees Only) You may request a free license for community development by clicking the "Free" button and then entering the required information in the supplied form.
#**** '''Licensee'''. This is '''your''' name.
#**** '''Company'''. ''Do NOT put the name of the Church or any of its business entities in this field.'' You may put "Home" or "Community" in this field if you will be using the MarkLogic Server for personal or community development.
#**** '''Email'''. This is your personal email address.
#**** Choose "Select Community License".
#**** Make a copy of your license information and then click "OK". The MarkLogic Server will restart.
#* Accept the license agreement by scrolling to the bottom of the page and clicking "Accept". The MarkLogic Server will restart.
#* You will now be prompted to install the initial databases and application servers. Click "OK" to continue. When installation is complete. The server will restart.
#* When prompted to enter an Admin username and password, enter "admin" as the user and "admin" as the password. Confirm the password and click "OK" to continue.
#* When prompted, enter the user name and password you supplied. You will then be redirected to the MarkLogic Server administration console.
# Configure an XDBC server
#* Expand "Groups", then expand "Default", and then select "App Servers".
#*; [[File:marklogic-admin-step1.png]] 
#* Click the tab "Create XDBC"
#** Supply an XDBC server name.
#*** For the Stack Petstore project, enter "stack-petstore".
#** Supply a value for the module directory root.
#*** For the Stack Petstore project, enter "/".
#** Supply a value for the port the XDBC server will listen to for connections.
#*** For the Stack Petstore project, enter "8010".
#** For the modules field, choose "Modules" from the list of options.
#** Leave all other fields as they are, and click "OK" at the top of the page.
#*; [[File:marklogic-admin-step2.png]] 
# To try out your Mark Logic installation, go to [http://localhost:8000 <nowiki>http://localhost:8000</nowiki>] and login with your admin username.

[[Category:MarkLogic]]

MarkLogic Query and Search

2014-11-14T08:54:41Z

Bowersmt: Added Back to MarkLogic

What is MarkLogic?

2014-11-14T08:53:55Z

Bowersmt: Fixed heading

[[MarkLogic|« Back to MarkLogic]]

== Describing MarkLogic ==
[http://www.marklogic.com/what-is-marklogic/ MarkLogic] is a [http://www.marklogic.com/what-is-marklogic/inside-marklogic/ '''REST application server''', '''document database''', and '''search engine''']. It is REST through and through. It is built specifically for hypertext documents, links, metadata, URIs, MIME types, and HTTP. It is schema-agnostic because it is automatically aware of the independent structure of each of its JSON and XML documents. It is search-centric because it can search for ''any combination'' of words, values, structures, and links within and across documents. It scales horizontally to hundreds of servers within and between data centers while maintaining ACID-compliant transactions.

MarkLogic has the following [http://www.marklogic.com/what-is-marklogic/enterprise-nosql/ enterprise features]:
* [http://www.marklogic.com/resources/java-developers-guide/ Java APIs]
* [http://en.wikipedia.org/wiki/Node.js Node.js APIs]
* [http://developer.marklogic.com/learn/2009-07-search-api-walkthrough Search] and [http://developer.marklogic.com/learn/arch/search-and-indexing Query]
* [http://www.marklogic.com/blog/can-you-pass-the-acid-test/ ACID Transactions]
* [http://www.marklogic.com/resources/marklogic-high-availability-and-disaster-recovery/resource_download/datasheets/ High Availability] and [http://docs.marklogic.com/guide/database-replication/dbrep_intro#chapter Disaster Recovery]
* [http://www.marklogic.com/resources/marklogic-flexible-replication/resource_download/datasheets/ Replication within and across data centers]
* [http://docs.marklogic.com/guide/admin/security#chapter Government-grade Security]
* [https://docs.marklogic.com/guide/cluster.pdf Scalability] and [https://docs.marklogic.com/guide/admin/database-rebalancing#chapter Elasticity]
* [https://docs.marklogic.com/guide/ec2/managing#chapter On-premise or Cloud Deployment (especially AWS)]
* [http://www.marklogic.com/resources/marklogic-and-hadoop/resource_download/datasheets/ Hadoop for Storage and Compute]
* [http://www.marklogic.com/resources/marklogic-semantics-mlw14/ Semantics]

== What is REST? ==
REST is an architectural style that uses a uniform resource identifier ('''URI''') and a web protocol ('''HTTP/HTTPS''') to request and transfer a representation ('''MIME media type''') of the state of a resource ('''document''') at a point in time from a server to a client.

REST was coined and defined by Roy Fielding in his
[http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm dissertation, ''Architectural Styles and the Design of Network-based Software Architectures'']. REST standardizes and documents the patterns used in the world wide web: self-documenting hypertext with data and metadata (HTML), stateless request/response communication protocol (HTTP/HTTPS), resource locators (URL), multiple representations per URL (MIME types), and downloading code on demand to process resources (JavaScript, CSS, etc.).

REST consists of three main concepts:
* (RE) Representation
* (S) State
* (T) Transfer

== Why is MarkLogic RESTful? ==

=== '''Representation and MarkLogic''' ===
* A representation is a document that represents a resource. It has three requirements: MIME media type, URI, and hypertext.

* '''MIME type:''' Each resource can be represented by one or more types of documents. The MIME media type defines the representation a client requests. A client may request the resource to be represented as JSON, XML, HTML, PNG, PDF, etc.
**'''MarkLogic''' stores any type of document and assigns the appropriate [http://developer.marklogic.com/blog/document-formats-part2 MIME type] to it. It can fully index, query, search, and process JSON, XML, and RDF documents. It meets the REST requirement of being able to transform JSON, XML, and RDF documents into any other MIME type. It knows how to execute JavaScript, XQuery, XSLT, and SPARQL documents. It knows how [http://docs.marklogic.com/guide/cpf/default#chapter to transform into XHTML] the content, formatting, and structure of Microsoft Word, PowerPoint, Excel, textual PDF, DocBook, and CSS documents. (It can also use Microsoft Office to create, edit, and manage content in MarkLogic.) It knows how to [https://docs.marklogic.com/guide/search-dev/binary-document-metadata extract metadata and text from over 138 types of binary documents], such as raster images, vector images, videos, archive files, database files, encoded emails, presentations, spreadsheets, word-processing documents, text formats.
**Few NoSQL databases use MIME types to identify the media type of each document. Most NoSQL databases support only one type of data and it is usually proprietary: columnar, BSON, binary, etc. Most cannot transform from one MIME type to another.

* '''URI:''' A resource is identified by a globally unique identifier (URI). '''MarkLogic''' identifies each document with a [http://developer.marklogic.com/try/rest/page2 unique URI]. A document in MarkLogic is like a row in a table in a schema in a relational database. A URI is liberating. It provides random access to any resource anywhere. It is like being able to retrieve any row in any table in any schema of a relational database without having to know what table and schema the row is stored in.
**Navigating URL hierarchy is fundamental to REST. MarkLogic understands the hierarchy within a URI, which is represented by slashes "/", such as https://www.lds.org/scriptures/ot/gen/1 MarkLogic treats the items between the slashes as folders. The URI of each document automatically places it in a folder in a folder hierarchy. A URI automatically defines the folder hierarchy. In the example URI above, the document for Genesis chapter 1, is located in the Genesis folder, which is located in the Old Testament folder, which is located in the Scriptures folder on lds.org. '''MarkLogic''' indexes the documents in each folder and its subfolders. This makes it fast and easy to retrieve any or all documents in any folder and/or its subfolders.
**Few NoSQL databases use the URI as the primary key for their documents or data. They also don't index the URI hierarchically to filter documents by folder and subfolder.

* '''Hypertext:''' A document should contain data that represents the resource. The data should be '''human readable and self-documenting''', like JSON, XML, RDF, and HTML. It should be '''linked data''' (i.e. the "hyper" in "hypertext"). A document should contain ''' metadata links''' about the resource, such as RDF. It should contain '''action links''' to define what further actions can be done with the resource. It should contain '''related links''' to related resources, such as images, audio, video, related documents, etc. Each related link should define what actions can be done with the referenced resource, such as download it, display a link to it, execute a command against it, etc.)
**'''Hypertext''' or '''hypermedia''' documents must have all these features. Hyperlinks are what the "hyper" in hypertext and hypermedia refers to. You can't have REST without ''metadata links'' to define what the data means, ''action links'' to know how to work with the resource, and ''related links'' to connect resources. It should all be human readable and self-documenting so a developer does not have to read documentation to know how to interact with a REST web service and its documents.
** '''MarkLogic''' meets all the requirements for hypertext representation. It is designed around MIME types, URIs, and Linked data. It stores documents with their MIME types as [http://www.w3schools.com/json/ JSON], [http://www.w3schools.com/xml/ XML], [http://www.w3schools.com/webservices/ws_rdf_intro.asp RDF], [http://www.w3schools.com/html/ HTML], etc. These documents are human-readable and self-documenting, which MarkLogic leverages to recognize and index each document's data, data structure, metadata links, action links, and related links. This makes it easy to '''search''', '''query''', '''transform''', and '''deliver''' hypertext documents. MarkLogic can also store any type of binary document and deliver it as a related resource, such as an image, video, or audio. MarkLogic is designed to process simple links and RDF links using [http://en.wikipedia.org/wiki/SPARQL SPARQL] and [https://docs.marklogic.com/xinc XInclude]. MarkLogic can represent links in many formats: [https://docs.marklogic.com/xp XPointer], [https://docs.marklogic.com/guide/semantics/loading#id_97709 RDF/XML], [https://docs.marklogic.com/guide/semantics/loading#id_79194 RDF/JSON], [https://docs.marklogic.com/guide/semantics/loading#id_73211 Turtle], [https://docs.marklogic.com/guide/semantics/loading#id_70596 N-Triples], [https://docs.marklogic.com/guide/semantics/loading#id_61596 N-Quads], and [https://docs.marklogic.com/guide/semantics/loading#id_74485 TriG].
**No other NoSQL database natively indexes and fully processes all the document types required for REST hypertext: JSON, XML, RDF, HTML, CSS, and JavaScript.

=== '''State and MarkLogic''' ===
* State in REST exists on the client and in server documents. It does not exist in the communications protocol or in the server as cached session data.

* '''Server:''' All information needed to process a request must be presented in the request and processed against documents in the database. State must only be in the request and in database documents: state cannot be anywhere else, such as in a session cache. The documents in the server define the state of the server. A REST server should explicitly create a state machine that defines the acceptable actions that can transacted against documents in specific contexts.
**A REST transaction occurs at a '''point in time'''. The state of the data in the request is unchanging, but the state of the documents and state machines in the database are often changing. Since request state and database state are both required to process the request and since shifting state creates unpredictable results, a REST transaction should run at a point in time with unchanging state. Only an ACID-compliant database can ensure consistent state because it isolates each transaction from every other transaction. The only time REST does not need an ACID-compliant database is when database documents do not change, database state machines do not change, or when clients can live with the resultant level of unreliable and unpredictable results.
**'''MarkLogic''' meets all the requirements for REST state. Its web services are stateless: there is no session cache. It is ACID compliant. It ensures each transaction occurs at a point in time and is isolated from all other transactions. This ensures consistent processing during a transaction -- even across billions of documents. MarkLogic is an [http://en.wikipedia.org/wiki/Multiversion_concurrency_control MVCC] database which provides transaction isolation without slowing the performance of reads -- even when documents being read are being modified simultaneously by other transactions. (Also, like any other ACID database, when multiple updates and deletes compete for the same documents, they will impact each other's performance because change has to be serialized.)
** Most other NoSQL databases are not ACID compliant. They are only suitable for REST services when their documents or data do not change or when the rate of change is slow enough or dispersed enough that it creates an acceptable level of unreliability and unpredictability.

* '''Client:''' A REST client, such as a web browser or spider, locally maintains transactional state, such as what to do in response to documents and result codes that are returned from server transactions. An application exists only in the client -- not in the server (although, a server may deliver application code to a client, such as when a web server downloads HTML, CSS, and JavaScript to a browser). Client application code decides when to execute web service calls and it ties the results together to accomplish its purpose. This allows multiple authorized applications to reuse web services for a variety of purposes.
**The server helps the client know what web service calls are available by providing action links with each response. Action links are contextual and the context is based on the application account, user account, database documents, links within and between documents, the server state machine, etc. Through action links, the server can inform a client application what web service calls are permissible in any given context.
**'''MarkLogic''' meets the needs of client applications through its built-in ability to process and send action links to clients based on context. MarkLogic supports RDF triples and SPARQL, which enables context to be defined across applications, users, document state, links between documents, server state machine, etc. MarkLogic processes triples very quickly, which enables context to scale to billions of documents, millions of users, etc.
**'''MarkLogic''' also uses application and user permissions to filter which documents are returned to clients. MarkLogic does this automatically and transparently by adding security filtering constraints into every search and query. This ensures no account can access unauthorized documents. This is fast because all security permissions are built-into MarkLogic's indexes -- which allows document-level security to scale across billions of documents.
** Most other NoSQL databases do not provide government-grade, document-level security and they also do not support RDF triples and SPARQL.

Because MarkLogic can provide both the web service and database in one server, it is easy to use the state of the documents and the

=== '''Transfer and MarkLogic''' ===
* '''Transfer''' in REST is a communication protocol that enables a client to send a hypertext request to a server and receive back a hypertext response. The transfer must be stateless and be a request/response protocol. It must have human readable, self-descriptive headers. The header must contain metadata about the request, such as the requested resource URI, MIME type of the resource, action to perform on the resource (such as get, put, post, patch, and delete). [http://www.w3schools.com/tags/ref_httpmethods.asp '''HTTP'''] (Hypertext transfer protocol) and '''HTTPS''' (secure HTTP) are designed specifically for REST (that is why they have "hypertext" in their name).

* All '''MarkLogic''' communication is through through HTTP and HTTPS REST services (except for its SQL JDBC feature). This includes all internal cluster communication. MarkLogic provides out-of-the-box REST interfaces for manipulating resources (insert, update, delete, query, search, transform, etc.) and administering MarkLogic (REST app servers, databases, indexes, clusters, etc). MarkLogic makes it very easy to create custom REST services because everything in MarkLogic is built around REST and because they provide simple and powerful application server APIs.

== MarkLogic Links ==
*[[MarkLogic]]
*[[MarkLogic Query and Search]]
*[[MarkLogic Training Resources]]
*[[Installing MarkLogic]]

[[Category: MarkLogic]]

MarkLogic Training Resources

2014-11-14T08:52:42Z

Bowersmt: Fixed heading

[[MarkLogic|« Back to MarkLogic]]

== MarkLogic Provided Training ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities, which are its most important feature. Everything you do in MarkLogic should be focused around how MarkLogic indexes, searches, and queries documents.

* [http://developer.marklogic.com/try/ninja/index Try Marklogic] is the best way to play with MarkLogic's search and query features. Without installing anything, you can run queries and searches against an existing database in the cloud.

* [http://developer.marklogic.com/learn MarkLogic Tutorials] guide you step-by-step through each major MarkLogic feature. Some tutorials are short and sweet five minutes, and others take up to an hour or so.

* [https://mlu.marklogic.com/registration/ Live Training] is available from MarkLogic at no cost. A live instructor will work with you to show you how to build MarkLogic applications, administer MarkLogic, use Semantics (triples), and MarkLogic fundamentals (like search and queries).

[[Category:MarkLogic]]

MarkLogic Training Resources

2014-11-14T08:51:55Z

Bowersmt: Added back to marklogic link

[[MarkLogic|« Back to MarkLogic]]

== MarkLogic Training Resources ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities, which are its most important feature. Everything you do in MarkLogic should be focused around how MarkLogic indexes, searches, and queries documents.

* [http://developer.marklogic.com/try/ninja/index Try Marklogic] is the best way to play with MarkLogic's search and query features. Without installing anything, you can run queries and searches against an existing database in the cloud.

* [http://developer.marklogic.com/learn MarkLogic Tutorials] guide you step-by-step through each major MarkLogic feature. Some tutorials are short and sweet five minutes, and others take up to an hour or so.

* [https://mlu.marklogic.com/registration/ Live Training] is available from MarkLogic at no cost. A live instructor will work with you to show you how to build MarkLogic applications, administer MarkLogic, use Semantics (triples), and MarkLogic fundamentals (like search and queries).

[[Category:MarkLogic]]

What is MarkLogic?

2014-11-14T08:50:59Z

Bowersmt: Added back to marklogic

[[MarkLogic|« Back to MarkLogic]]

== What is MarkLogic? ==
[http://www.marklogic.com/what-is-marklogic/ MarkLogic] is a [http://www.marklogic.com/what-is-marklogic/inside-marklogic/ '''REST application server''', '''document database''', and '''search engine''']. It is REST through and through. It is built specifically for hypertext documents, links, metadata, URIs, MIME types, and HTTP. It is schema-agnostic because it is automatically aware of the independent structure of each of its JSON and XML documents. It is search-centric because it can search for ''any combination'' of words, values, structures, and links within and across documents. It scales horizontally to hundreds of servers within and between data centers while maintaining ACID-compliant transactions.

MarkLogic has the following [http://www.marklogic.com/what-is-marklogic/enterprise-nosql/ enterprise features]:
* [http://www.marklogic.com/resources/java-developers-guide/ Java APIs]
* [http://en.wikipedia.org/wiki/Node.js Node.js APIs]
* [http://developer.marklogic.com/learn/2009-07-search-api-walkthrough Search] and [http://developer.marklogic.com/learn/arch/search-and-indexing Query]
* [http://www.marklogic.com/blog/can-you-pass-the-acid-test/ ACID Transactions]
* [http://www.marklogic.com/resources/marklogic-high-availability-and-disaster-recovery/resource_download/datasheets/ High Availability] and [http://docs.marklogic.com/guide/database-replication/dbrep_intro#chapter Disaster Recovery]
* [http://www.marklogic.com/resources/marklogic-flexible-replication/resource_download/datasheets/ Replication within and across data centers]
* [http://docs.marklogic.com/guide/admin/security#chapter Government-grade Security]
* [https://docs.marklogic.com/guide/cluster.pdf Scalability] and [https://docs.marklogic.com/guide/admin/database-rebalancing#chapter Elasticity]
* [https://docs.marklogic.com/guide/ec2/managing#chapter On-premise or Cloud Deployment (especially AWS)]
* [http://www.marklogic.com/resources/marklogic-and-hadoop/resource_download/datasheets/ Hadoop for Storage and Compute]
* [http://www.marklogic.com/resources/marklogic-semantics-mlw14/ Semantics]

== What is REST? ==
REST is an architectural style that uses a uniform resource identifier ('''URI''') and a web protocol ('''HTTP/HTTPS''') to request and transfer a representation ('''MIME media type''') of the state of a resource ('''document''') at a point in time from a server to a client.

REST was coined and defined by Roy Fielding in his
[http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm dissertation, ''Architectural Styles and the Design of Network-based Software Architectures'']. REST standardizes and documents the patterns used in the world wide web: self-documenting hypertext with data and metadata (HTML), stateless request/response communication protocol (HTTP/HTTPS), resource locators (URL), multiple representations per URL (MIME types), and downloading code on demand to process resources (JavaScript, CSS, etc.).

REST consists of three main concepts:
* (RE) Representation
* (S) State
* (T) Transfer

== Why is MarkLogic RESTful? ==

=== '''Representation and MarkLogic''' ===
* A representation is a document that represents a resource. It has three requirements: MIME media type, URI, and hypertext.

* '''MIME type:''' Each resource can be represented by one or more types of documents. The MIME media type defines the representation a client requests. A client may request the resource to be represented as JSON, XML, HTML, PNG, PDF, etc.
**'''MarkLogic''' stores any type of document and assigns the appropriate [http://developer.marklogic.com/blog/document-formats-part2 MIME type] to it. It can fully index, query, search, and process JSON, XML, and RDF documents. It meets the REST requirement of being able to transform JSON, XML, and RDF documents into any other MIME type. It knows how to execute JavaScript, XQuery, XSLT, and SPARQL documents. It knows how [http://docs.marklogic.com/guide/cpf/default#chapter to transform into XHTML] the content, formatting, and structure of Microsoft Word, PowerPoint, Excel, textual PDF, DocBook, and CSS documents. (It can also use Microsoft Office to create, edit, and manage content in MarkLogic.) It knows how to [https://docs.marklogic.com/guide/search-dev/binary-document-metadata extract metadata and text from over 138 types of binary documents], such as raster images, vector images, videos, archive files, database files, encoded emails, presentations, spreadsheets, word-processing documents, text formats.
**Few NoSQL databases use MIME types to identify the media type of each document. Most NoSQL databases support only one type of data and it is usually proprietary: columnar, BSON, binary, etc. Most cannot transform from one MIME type to another.

* '''URI:''' A resource is identified by a globally unique identifier (URI). '''MarkLogic''' identifies each document with a [http://developer.marklogic.com/try/rest/page2 unique URI]. A document in MarkLogic is like a row in a table in a schema in a relational database. A URI is liberating. It provides random access to any resource anywhere. It is like being able to retrieve any row in any table in any schema of a relational database without having to know what table and schema the row is stored in.
**Navigating URL hierarchy is fundamental to REST. MarkLogic understands the hierarchy within a URI, which is represented by slashes "/", such as https://www.lds.org/scriptures/ot/gen/1 MarkLogic treats the items between the slashes as folders. The URI of each document automatically places it in a folder in a folder hierarchy. A URI automatically defines the folder hierarchy. In the example URI above, the document for Genesis chapter 1, is located in the Genesis folder, which is located in the Old Testament folder, which is located in the Scriptures folder on lds.org. '''MarkLogic''' indexes the documents in each folder and its subfolders. This makes it fast and easy to retrieve any or all documents in any folder and/or its subfolders.
**Few NoSQL databases use the URI as the primary key for their documents or data. They also don't index the URI hierarchically to filter documents by folder and subfolder.

* '''Hypertext:''' A document should contain data that represents the resource. The data should be '''human readable and self-documenting''', like JSON, XML, RDF, and HTML. It should be '''linked data''' (i.e. the "hyper" in "hypertext"). A document should contain ''' metadata links''' about the resource, such as RDF. It should contain '''action links''' to define what further actions can be done with the resource. It should contain '''related links''' to related resources, such as images, audio, video, related documents, etc. Each related link should define what actions can be done with the referenced resource, such as download it, display a link to it, execute a command against it, etc.)
**'''Hypertext''' or '''hypermedia''' documents must have all these features. Hyperlinks are what the "hyper" in hypertext and hypermedia refers to. You can't have REST without ''metadata links'' to define what the data means, ''action links'' to know how to work with the resource, and ''related links'' to connect resources. It should all be human readable and self-documenting so a developer does not have to read documentation to know how to interact with a REST web service and its documents.
** '''MarkLogic''' meets all the requirements for hypertext representation. It is designed around MIME types, URIs, and Linked data. It stores documents with their MIME types as [http://www.w3schools.com/json/ JSON], [http://www.w3schools.com/xml/ XML], [http://www.w3schools.com/webservices/ws_rdf_intro.asp RDF], [http://www.w3schools.com/html/ HTML], etc. These documents are human-readable and self-documenting, which MarkLogic leverages to recognize and index each document's data, data structure, metadata links, action links, and related links. This makes it easy to '''search''', '''query''', '''transform''', and '''deliver''' hypertext documents. MarkLogic can also store any type of binary document and deliver it as a related resource, such as an image, video, or audio. MarkLogic is designed to process simple links and RDF links using [http://en.wikipedia.org/wiki/SPARQL SPARQL] and [https://docs.marklogic.com/xinc XInclude]. MarkLogic can represent links in many formats: [https://docs.marklogic.com/xp XPointer], [https://docs.marklogic.com/guide/semantics/loading#id_97709 RDF/XML], [https://docs.marklogic.com/guide/semantics/loading#id_79194 RDF/JSON], [https://docs.marklogic.com/guide/semantics/loading#id_73211 Turtle], [https://docs.marklogic.com/guide/semantics/loading#id_70596 N-Triples], [https://docs.marklogic.com/guide/semantics/loading#id_61596 N-Quads], and [https://docs.marklogic.com/guide/semantics/loading#id_74485 TriG].
**No other NoSQL database natively indexes and fully processes all the document types required for REST hypertext: JSON, XML, RDF, HTML, CSS, and JavaScript.

=== '''State and MarkLogic''' ===
* State in REST exists on the client and in server documents. It does not exist in the communications protocol or in the server as cached session data.

* '''Server:''' All information needed to process a request must be presented in the request and processed against documents in the database. State must only be in the request and in database documents: state cannot be anywhere else, such as in a session cache. The documents in the server define the state of the server. A REST server should explicitly create a state machine that defines the acceptable actions that can transacted against documents in specific contexts.
**A REST transaction occurs at a '''point in time'''. The state of the data in the request is unchanging, but the state of the documents and state machines in the database are often changing. Since request state and database state are both required to process the request and since shifting state creates unpredictable results, a REST transaction should run at a point in time with unchanging state. Only an ACID-compliant database can ensure consistent state because it isolates each transaction from every other transaction. The only time REST does not need an ACID-compliant database is when database documents do not change, database state machines do not change, or when clients can live with the resultant level of unreliable and unpredictable results.
**'''MarkLogic''' meets all the requirements for REST state. Its web services are stateless: there is no session cache. It is ACID compliant. It ensures each transaction occurs at a point in time and is isolated from all other transactions. This ensures consistent processing during a transaction -- even across billions of documents. MarkLogic is an [http://en.wikipedia.org/wiki/Multiversion_concurrency_control MVCC] database which provides transaction isolation without slowing the performance of reads -- even when documents being read are being modified simultaneously by other transactions. (Also, like any other ACID database, when multiple updates and deletes compete for the same documents, they will impact each other's performance because change has to be serialized.)
** Most other NoSQL databases are not ACID compliant. They are only suitable for REST services when their documents or data do not change or when the rate of change is slow enough or dispersed enough that it creates an acceptable level of unreliability and unpredictability.

* '''Client:''' A REST client, such as a web browser or spider, locally maintains transactional state, such as what to do in response to documents and result codes that are returned from server transactions. An application exists only in the client -- not in the server (although, a server may deliver application code to a client, such as when a web server downloads HTML, CSS, and JavaScript to a browser). Client application code decides when to execute web service calls and it ties the results together to accomplish its purpose. This allows multiple authorized applications to reuse web services for a variety of purposes.
**The server helps the client know what web service calls are available by providing action links with each response. Action links are contextual and the context is based on the application account, user account, database documents, links within and between documents, the server state machine, etc. Through action links, the server can inform a client application what web service calls are permissible in any given context.
**'''MarkLogic''' meets the needs of client applications through its built-in ability to process and send action links to clients based on context. MarkLogic supports RDF triples and SPARQL, which enables context to be defined across applications, users, document state, links between documents, server state machine, etc. MarkLogic processes triples very quickly, which enables context to scale to billions of documents, millions of users, etc.
**'''MarkLogic''' also uses application and user permissions to filter which documents are returned to clients. MarkLogic does this automatically and transparently by adding security filtering constraints into every search and query. This ensures no account can access unauthorized documents. This is fast because all security permissions are built-into MarkLogic's indexes -- which allows document-level security to scale across billions of documents.
** Most other NoSQL databases do not provide government-grade, document-level security and they also do not support RDF triples and SPARQL.

Because MarkLogic can provide both the web service and database in one server, it is easy to use the state of the documents and the

=== '''Transfer and MarkLogic''' ===
* '''Transfer''' in REST is a communication protocol that enables a client to send a hypertext request to a server and receive back a hypertext response. The transfer must be stateless and be a request/response protocol. It must have human readable, self-descriptive headers. The header must contain metadata about the request, such as the requested resource URI, MIME type of the resource, action to perform on the resource (such as get, put, post, patch, and delete). [http://www.w3schools.com/tags/ref_httpmethods.asp '''HTTP'''] (Hypertext transfer protocol) and '''HTTPS''' (secure HTTP) are designed specifically for REST (that is why they have "hypertext" in their name).

* All '''MarkLogic''' communication is through through HTTP and HTTPS REST services (except for its SQL JDBC feature). This includes all internal cluster communication. MarkLogic provides out-of-the-box REST interfaces for manipulating resources (insert, update, delete, query, search, transform, etc.) and administering MarkLogic (REST app servers, databases, indexes, clusters, etc). MarkLogic makes it very easy to create custom REST services because everything in MarkLogic is built around REST and because they provide simple and powerful application server APIs.

== MarkLogic Links ==
*[[MarkLogic]]
*[[MarkLogic Query and Search]]
*[[MarkLogic Training Resources]]
*[[Installing MarkLogic]]

[[Category: MarkLogic]]

What is MarkLogic?

2014-11-14T08:44:57Z

Bowersmt: Added links and removed search

== What is MarkLogic? ==
[http://www.marklogic.com/what-is-marklogic/ MarkLogic] is a [http://www.marklogic.com/what-is-marklogic/inside-marklogic/ '''REST application server''', '''document database''', and '''search engine''']. It is REST through and through. It is built specifically for hypertext documents, links, metadata, URIs, MIME types, and HTTP. It is schema-agnostic because it is automatically aware of the independent structure of each of its JSON and XML documents. It is search-centric because it can search for ''any combination'' of words, values, structures, and links within and across documents. It scales horizontally to hundreds of servers within and between data centers while maintaining ACID-compliant transactions.

MarkLogic has the following [http://www.marklogic.com/what-is-marklogic/enterprise-nosql/ enterprise features]:
* [http://www.marklogic.com/resources/java-developers-guide/ Java APIs]
* [http://en.wikipedia.org/wiki/Node.js Node.js APIs]
* [http://developer.marklogic.com/learn/2009-07-search-api-walkthrough Search] and [http://developer.marklogic.com/learn/arch/search-and-indexing Query]
* [http://www.marklogic.com/blog/can-you-pass-the-acid-test/ ACID Transactions]
* [http://www.marklogic.com/resources/marklogic-high-availability-and-disaster-recovery/resource_download/datasheets/ High Availability] and [http://docs.marklogic.com/guide/database-replication/dbrep_intro#chapter Disaster Recovery]
* [http://www.marklogic.com/resources/marklogic-flexible-replication/resource_download/datasheets/ Replication within and across data centers]
* [http://docs.marklogic.com/guide/admin/security#chapter Government-grade Security]
* [https://docs.marklogic.com/guide/cluster.pdf Scalability] and [https://docs.marklogic.com/guide/admin/database-rebalancing#chapter Elasticity]
* [https://docs.marklogic.com/guide/ec2/managing#chapter On-premise or Cloud Deployment (especially AWS)]
* [http://www.marklogic.com/resources/marklogic-and-hadoop/resource_download/datasheets/ Hadoop for Storage and Compute]
* [http://www.marklogic.com/resources/marklogic-semantics-mlw14/ Semantics]

== What is REST? ==
REST is an architectural style that uses a uniform resource identifier ('''URI''') and a web protocol ('''HTTP/HTTPS''') to request and transfer a representation ('''MIME media type''') of the state of a resource ('''document''') at a point in time from a server to a client.

REST was coined and defined by Roy Fielding in his
[http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm dissertation, ''Architectural Styles and the Design of Network-based Software Architectures'']. REST standardizes and documents the patterns used in the world wide web: self-documenting hypertext with data and metadata (HTML), stateless request/response communication protocol (HTTP/HTTPS), resource locators (URL), multiple representations per URL (MIME types), and downloading code on demand to process resources (JavaScript, CSS, etc.).

REST consists of three main concepts:
* (RE) Representation
* (S) State
* (T) Transfer

== Why is MarkLogic RESTful? ==

=== '''Representation and MarkLogic''' ===
* A representation is a document that represents a resource. It has three requirements: MIME media type, URI, and hypertext.

* '''MIME type:''' Each resource can be represented by one or more types of documents. The MIME media type defines the representation a client requests. A client may request the resource to be represented as JSON, XML, HTML, PNG, PDF, etc.
**'''MarkLogic''' stores any type of document and assigns the appropriate [http://developer.marklogic.com/blog/document-formats-part2 MIME type] to it. It can fully index, query, search, and process JSON, XML, and RDF documents. It meets the REST requirement of being able to transform JSON, XML, and RDF documents into any other MIME type. It knows how to execute JavaScript, XQuery, XSLT, and SPARQL documents. It knows how [http://docs.marklogic.com/guide/cpf/default#chapter to transform into XHTML] the content, formatting, and structure of Microsoft Word, PowerPoint, Excel, textual PDF, DocBook, and CSS documents. (It can also use Microsoft Office to create, edit, and manage content in MarkLogic.) It knows how to [https://docs.marklogic.com/guide/search-dev/binary-document-metadata extract metadata and text from over 138 types of binary documents], such as raster images, vector images, videos, archive files, database files, encoded emails, presentations, spreadsheets, word-processing documents, text formats.
**Few NoSQL databases use MIME types to identify the media type of each document. Most NoSQL databases support only one type of data and it is usually proprietary: columnar, BSON, binary, etc. Most cannot transform from one MIME type to another.

* '''URI:''' A resource is identified by a globally unique identifier (URI). '''MarkLogic''' identifies each document with a [http://developer.marklogic.com/try/rest/page2 unique URI]. A document in MarkLogic is like a row in a table in a schema in a relational database. A URI is liberating. It provides random access to any resource anywhere. It is like being able to retrieve any row in any table in any schema of a relational database without having to know what table and schema the row is stored in.
**Navigating URL hierarchy is fundamental to REST. MarkLogic understands the hierarchy within a URI, which is represented by slashes "/", such as https://www.lds.org/scriptures/ot/gen/1 MarkLogic treats the items between the slashes as folders. The URI of each document automatically places it in a folder in a folder hierarchy. A URI automatically defines the folder hierarchy. In the example URI above, the document for Genesis chapter 1, is located in the Genesis folder, which is located in the Old Testament folder, which is located in the Scriptures folder on lds.org. '''MarkLogic''' indexes the documents in each folder and its subfolders. This makes it fast and easy to retrieve any or all documents in any folder and/or its subfolders.
**Few NoSQL databases use the URI as the primary key for their documents or data. They also don't index the URI hierarchically to filter documents by folder and subfolder.

* '''Hypertext:''' A document should contain data that represents the resource. The data should be '''human readable and self-documenting''', like JSON, XML, RDF, and HTML. It should be '''linked data''' (i.e. the "hyper" in "hypertext"). A document should contain ''' metadata links''' about the resource, such as RDF. It should contain '''action links''' to define what further actions can be done with the resource. It should contain '''related links''' to related resources, such as images, audio, video, related documents, etc. Each related link should define what actions can be done with the referenced resource, such as download it, display a link to it, execute a command against it, etc.)
**'''Hypertext''' or '''hypermedia''' documents must have all these features. Hyperlinks are what the "hyper" in hypertext and hypermedia refers to. You can't have REST without ''metadata links'' to define what the data means, ''action links'' to know how to work with the resource, and ''related links'' to connect resources. It should all be human readable and self-documenting so a developer does not have to read documentation to know how to interact with a REST web service and its documents.
** '''MarkLogic''' meets all the requirements for hypertext representation. It is designed around MIME types, URIs, and Linked data. It stores documents with their MIME types as [http://www.w3schools.com/json/ JSON], [http://www.w3schools.com/xml/ XML], [http://www.w3schools.com/webservices/ws_rdf_intro.asp RDF], [http://www.w3schools.com/html/ HTML], etc. These documents are human-readable and self-documenting, which MarkLogic leverages to recognize and index each document's data, data structure, metadata links, action links, and related links. This makes it easy to '''search''', '''query''', '''transform''', and '''deliver''' hypertext documents. MarkLogic can also store any type of binary document and deliver it as a related resource, such as an image, video, or audio. MarkLogic is designed to process simple links and RDF links using [http://en.wikipedia.org/wiki/SPARQL SPARQL] and [https://docs.marklogic.com/xinc XInclude]. MarkLogic can represent links in many formats: [https://docs.marklogic.com/xp XPointer], [https://docs.marklogic.com/guide/semantics/loading#id_97709 RDF/XML], [https://docs.marklogic.com/guide/semantics/loading#id_79194 RDF/JSON], [https://docs.marklogic.com/guide/semantics/loading#id_73211 Turtle], [https://docs.marklogic.com/guide/semantics/loading#id_70596 N-Triples], [https://docs.marklogic.com/guide/semantics/loading#id_61596 N-Quads], and [https://docs.marklogic.com/guide/semantics/loading#id_74485 TriG].
**No other NoSQL database natively indexes and fully processes all the document types required for REST hypertext: JSON, XML, RDF, HTML, CSS, and JavaScript.

=== '''State and MarkLogic''' ===
* State in REST exists on the client and in server documents. It does not exist in the communications protocol or in the server as cached session data.

* '''Server:''' All information needed to process a request must be presented in the request and processed against documents in the database. State must only be in the request and in database documents: state cannot be anywhere else, such as in a session cache. The documents in the server define the state of the server. A REST server should explicitly create a state machine that defines the acceptable actions that can transacted against documents in specific contexts.
**A REST transaction occurs at a '''point in time'''. The state of the data in the request is unchanging, but the state of the documents and state machines in the database are often changing. Since request state and database state are both required to process the request and since shifting state creates unpredictable results, a REST transaction should run at a point in time with unchanging state. Only an ACID-compliant database can ensure consistent state because it isolates each transaction from every other transaction. The only time REST does not need an ACID-compliant database is when database documents do not change, database state machines do not change, or when clients can live with the resultant level of unreliable and unpredictable results.
**'''MarkLogic''' meets all the requirements for REST state. Its web services are stateless: there is no session cache. It is ACID compliant. It ensures each transaction occurs at a point in time and is isolated from all other transactions. This ensures consistent processing during a transaction -- even across billions of documents. MarkLogic is an [http://en.wikipedia.org/wiki/Multiversion_concurrency_control MVCC] database which provides transaction isolation without slowing the performance of reads -- even when documents being read are being modified simultaneously by other transactions. (Also, like any other ACID database, when multiple updates and deletes compete for the same documents, they will impact each other's performance because change has to be serialized.)
** Most other NoSQL databases are not ACID compliant. They are only suitable for REST services when their documents or data do not change or when the rate of change is slow enough or dispersed enough that it creates an acceptable level of unreliability and unpredictability.

* '''Client:''' A REST client, such as a web browser or spider, locally maintains transactional state, such as what to do in response to documents and result codes that are returned from server transactions. An application exists only in the client -- not in the server (although, a server may deliver application code to a client, such as when a web server downloads HTML, CSS, and JavaScript to a browser). Client application code decides when to execute web service calls and it ties the results together to accomplish its purpose. This allows multiple authorized applications to reuse web services for a variety of purposes.
**The server helps the client know what web service calls are available by providing action links with each response. Action links are contextual and the context is based on the application account, user account, database documents, links within and between documents, the server state machine, etc. Through action links, the server can inform a client application what web service calls are permissible in any given context.
**'''MarkLogic''' meets the needs of client applications through its built-in ability to process and send action links to clients based on context. MarkLogic supports RDF triples and SPARQL, which enables context to be defined across applications, users, document state, links between documents, server state machine, etc. MarkLogic processes triples very quickly, which enables context to scale to billions of documents, millions of users, etc.
**'''MarkLogic''' also uses application and user permissions to filter which documents are returned to clients. MarkLogic does this automatically and transparently by adding security filtering constraints into every search and query. This ensures no account can access unauthorized documents. This is fast because all security permissions are built-into MarkLogic's indexes -- which allows document-level security to scale across billions of documents.
** Most other NoSQL databases do not provide government-grade, document-level security and they also do not support RDF triples and SPARQL.

Because MarkLogic can provide both the web service and database in one server, it is easy to use the state of the documents and the

=== '''Transfer and MarkLogic''' ===
* '''Transfer''' in REST is a communication protocol that enables a client to send a hypertext request to a server and receive back a hypertext response. The transfer must be stateless and be a request/response protocol. It must have human readable, self-descriptive headers. The header must contain metadata about the request, such as the requested resource URI, MIME type of the resource, action to perform on the resource (such as get, put, post, patch, and delete). [http://www.w3schools.com/tags/ref_httpmethods.asp '''HTTP'''] (Hypertext transfer protocol) and '''HTTPS''' (secure HTTP) are designed specifically for REST (that is why they have "hypertext" in their name).

* All '''MarkLogic''' communication is through through HTTP and HTTPS REST services (except for its SQL JDBC feature). This includes all internal cluster communication. MarkLogic provides out-of-the-box REST interfaces for manipulating resources (insert, update, delete, query, search, transform, etc.) and administering MarkLogic (REST app servers, databases, indexes, clusters, etc). MarkLogic makes it very easy to create custom REST services because everything in MarkLogic is built around REST and because they provide simple and powerful application server APIs.

== MarkLogic Links ==
*[[MarkLogic]]
*[[MarkLogic Query and Search]]
*[[MarkLogic Training Resources]]
*[[Installing MarkLogic]]

[[Category: MarkLogic]]

MarkLogic

2014-11-14T08:41:04Z

Bowersmt: Adding headings

== MarkLogic Landing Page ==
This is the MarkLogic home page. MarkLogic is an all-in-one application server, document database, and search engine. It is entirely based on REST. For an extensive explanation of what MarkLogic is and what it can do, see [[What is MarkLogic?]]

== Requesting a MarkLogic License ==
Employees of the Church may request a license key by emailing [mailto:DL-ICS-MARKLOGIC DL-ICS-MARKLOGIC]. Members of the community may also install a local copy of MarkLogic but must request a [http://developer.marklogic.com/free-developer developer license key from MarkLogic].

== Links to other MarkLogic Documents ==
* [[What is MarkLogic?]]
* [[MarkLogic Query and Search]]
* [[Installing MarkLogic]]
* [[MarkLogic Training Resources]]

[[Category: MarkLogic]]

MarkLogic

2014-11-14T08:39:31Z

Bowersmt: Added more information on getting a licence key

== MarkLogic ==
This is the landing page for MarkLogic. MarkLogic is an all-in-one application server, document database, and search engine. It is entirely based on REST. For an extensive explanation of what MarkLogic is and what it can do, see [[What is MarkLogic?]]

Employees of the Church may request a license key by emailing [mailto:DL-ICS-MARKLOGIC DL-ICS-MARKLOGIC]. Members of the community may also install a local copy of MarkLogic but must request a [http://developer.marklogic.com/free-developer developer license key from MarkLogic].

== Links to other MarkLogic Documents ==
* [[What is MarkLogic?]]
* [[MarkLogic Query and Search]]
* [[Installing MarkLogic]]
* [[MarkLogic Training Resources]]

[[Category: MarkLogic]]

MarkLogic

2014-11-14T08:25:09Z

Bowersmt: Updated links

MarkLogic Query and Search

2014-11-14T08:24:04Z

Bowersmt: Changed Category

== MarkLogic Query and Search ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. Both use indexes to resolve queries without having to scan through rows or documents.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#it validates the query request from the client
#it parses and optimizes the query
#it sends the query to all servers that have forests in the database
##each server sends it to its forests
##each forest sends it to each stand
##each stand uses its indexes to find matching documents
###all stands and forests query the indexes in parallel
##forests combine the matching document IDs from all their stands
##forests sort the matching document IDs (in relevance score order or value order)
##when complete, each forest returns a sorted iterator to the initiating server
#when each forest has returned an iterator
#it walks the sorted iterators from all the forests and combines the results in sorted
#it retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#it filters returned documents to remove false positive matches
#it optionally transforms documents
##it creates new documents by extracting matching elements, changing structure, changing file format, etc.
#it returns all matching (optionally transformed) documents to the requester
#it saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, whcih it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

=== What specific indexes does MarkLogic have? ===
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''
*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>
*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>
*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>
*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>
*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>
*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>
*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.
*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.
*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>
*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>
*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>
*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>
*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>
*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>
*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Category: MarkLogic]]

What is MarkLogic?

2014-11-14T08:23:30Z

Bowersmt: Changed category

== What is MarkLogic? ==
[http://www.marklogic.com/what-is-marklogic/ MarkLogic] is a [http://www.marklogic.com/what-is-marklogic/inside-marklogic/ '''REST application server''', '''document database''', and '''search engine''']. It is REST through and through. It is built specifically for hypertext documents, links, metadata, URIs, MIME types, and HTTP. It is schema-agnostic because it is automatically aware of the independent structure of each of its JSON and XML documents. It is search-centric because it can search for ''any combination'' of words, values, structures, and links within and across documents. It scales horizontally to hundreds of servers within and between data centers while maintaining ACID-compliant transactions.

MarkLogic has the following [http://www.marklogic.com/what-is-marklogic/enterprise-nosql/ enterprise features]:
* [http://www.marklogic.com/resources/java-developers-guide/ Java APIs]
* [http://en.wikipedia.org/wiki/Node.js Node.js APIs]
* [http://developer.marklogic.com/learn/2009-07-search-api-walkthrough Search] and [http://developer.marklogic.com/learn/arch/search-and-indexing Query]
* [http://www.marklogic.com/blog/can-you-pass-the-acid-test/ ACID Transactions]
* [http://www.marklogic.com/resources/marklogic-high-availability-and-disaster-recovery/resource_download/datasheets/ High Availability] and [http://docs.marklogic.com/guide/database-replication/dbrep_intro#chapter Disaster Recovery]
* [http://www.marklogic.com/resources/marklogic-flexible-replication/resource_download/datasheets/ Replication within and across data centers]
* [http://docs.marklogic.com/guide/admin/security#chapter Government-grade Security]
* [https://docs.marklogic.com/guide/cluster.pdf Scalability] and [https://docs.marklogic.com/guide/admin/database-rebalancing#chapter Elasticity]
* [https://docs.marklogic.com/guide/ec2/managing#chapter On-premise or Cloud Deployment (especially AWS)]
* [http://www.marklogic.com/resources/marklogic-and-hadoop/resource_download/datasheets/ Hadoop for Storage and Compute]
* [http://www.marklogic.com/resources/marklogic-semantics-mlw14/ Semantics]

== What is REST? ==
REST is an architectural style that uses a uniform resource identifier ('''URI''') and a web protocol ('''HTTP/HTTPS''') to request and transfer a representation ('''MIME media type''') of the state of a resource ('''document''') at a point in time from a server to a client.

REST was coined and defined by Roy Fielding in his
[http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm dissertation, ''Architectural Styles and the Design of Network-based Software Architectures'']. REST standardizes and documents the patterns used in the world wide web: self-documenting hypertext with data and metadata (HTML), stateless request/response communication protocol (HTTP/HTTPS), resource locators (URL), multiple representations per URL (MIME types), and downloading code on demand to process resources (JavaScript, CSS, etc.).

REST consists of three main concepts:
* (RE) Representation
* (S) State
* (T) Transfer

== Why is MarkLogic RESTful? ==

=== '''Representation and MarkLogic''' ===
* A representation is a document that represents a resource. It has three requirements: MIME media type, URI, and hypertext.

* '''MIME type:''' Each resource can be represented by one or more types of documents. The MIME media type defines the representation a client requests. A client may request the resource to be represented as JSON, XML, HTML, PNG, PDF, etc.
**'''MarkLogic''' stores any type of document and assigns the appropriate [http://developer.marklogic.com/blog/document-formats-part2 MIME type] to it. It can fully index, query, search, and process JSON, XML, and RDF documents. It meets the REST requirement of being able to transform JSON, XML, and RDF documents into any other MIME type. It knows how to execute JavaScript, XQuery, XSLT, and SPARQL documents. It knows how [http://docs.marklogic.com/guide/cpf/default#chapter to transform into XHTML] the content, formatting, and structure of Microsoft Word, PowerPoint, Excel, textual PDF, DocBook, and CSS documents. (It can also use Microsoft Office to create, edit, and manage content in MarkLogic.) It knows how to [https://docs.marklogic.com/guide/search-dev/binary-document-metadata extract metadata and text from over 138 types of binary documents], such as raster images, vector images, videos, archive files, database files, encoded emails, presentations, spreadsheets, word-processing documents, text formats.
**Few NoSQL databases use MIME types to identify the media type of each document. Most NoSQL databases support only one type of data and it is usually proprietary: columnar, BSON, binary, etc. Most cannot transform from one MIME type to another.

* '''URI:''' A resource is identified by a globally unique identifier (URI). '''MarkLogic''' identifies each document with a [http://developer.marklogic.com/try/rest/page2 unique URI]. A document in MarkLogic is like a row in a table in a schema in a relational database. A URI is liberating. It provides random access to any resource anywhere. It is like being able to retrieve any row in any table in any schema of a relational database without having to know what table and schema the row is stored in.
**Navigating URL hierarchy is fundamental to REST. MarkLogic understands the hierarchy within a URI, which is represented by slashes "/", such as https://www.lds.org/scriptures/ot/gen/1 MarkLogic treats the items between the slashes as folders. The URI of each document automatically places it in a folder in a folder hierarchy. A URI automatically defines the folder hierarchy. In the example URI above, the document for Genesis chapter 1, is located in the Genesis folder, which is located in the Old Testament folder, which is located in the Scriptures folder on lds.org. '''MarkLogic''' indexes the documents in each folder and its subfolders. This makes it fast and easy to retrieve any or all documents in any folder and/or its subfolders.
**Few NoSQL databases use the URI as the primary key for their documents or data. They also don't index the URI hierarchically to filter documents by folder and subfolder.

* '''Hypertext:''' A document should contain data that represents the resource. The data should be '''human readable and self-documenting''', like JSON, XML, RDF, and HTML. It should be '''linked data''' (i.e. the "hyper" in "hypertext"). A document should contain ''' metadata links''' about the resource, such as RDF. It should contain '''action links''' to define what further actions can be done with the resource. It should contain '''related links''' to related resources, such as images, audio, video, related documents, etc. Each related link should define what actions can be done with the referenced resource, such as download it, display a link to it, execute a command against it, etc.)
**'''Hypertext''' or '''hypermedia''' documents must have all these features. Hyperlinks are what the "hyper" in hypertext and hypermedia refers to. You can't have REST without ''metadata links'' to define what the data means, ''action links'' to know how to work with the resource, and ''related links'' to connect resources. It should all be human readable and self-documenting so a developer does not have to read documentation to know how to interact with a REST web service and its documents.
** '''MarkLogic''' meets all the requirements for hypertext representation. It is designed around MIME types, URIs, and Linked data. It stores documents with their MIME types as [http://www.w3schools.com/json/ JSON], [http://www.w3schools.com/xml/ XML], [http://www.w3schools.com/webservices/ws_rdf_intro.asp RDF], [http://www.w3schools.com/html/ HTML], etc. These documents are human-readable and self-documenting, which MarkLogic leverages to recognize and index each document's data, data structure, metadata links, action links, and related links. This makes it easy to '''search''', '''query''', '''transform''', and '''deliver''' hypertext documents. MarkLogic can also store any type of binary document and deliver it as a related resource, such as an image, video, or audio. MarkLogic is designed to process simple links and RDF links using [http://en.wikipedia.org/wiki/SPARQL SPARQL] and [https://docs.marklogic.com/xinc XInclude]. MarkLogic can represent links in many formats: [https://docs.marklogic.com/xp XPointer], [https://docs.marklogic.com/guide/semantics/loading#id_97709 RDF/XML], [https://docs.marklogic.com/guide/semantics/loading#id_79194 RDF/JSON], [https://docs.marklogic.com/guide/semantics/loading#id_73211 Turtle], [https://docs.marklogic.com/guide/semantics/loading#id_70596 N-Triples], [https://docs.marklogic.com/guide/semantics/loading#id_61596 N-Quads], and [https://docs.marklogic.com/guide/semantics/loading#id_74485 TriG].
**No other NoSQL database natively indexes and fully processes all the document types required for REST hypertext: JSON, XML, RDF, HTML, CSS, and JavaScript.

=== '''State and MarkLogic''' ===
* State in REST exists on the client and in server documents. It does not exist in the communications protocol or in the server as cached session data.

* '''Server:''' All information needed to process a request must be presented in the request and processed against documents in the database. State must only be in the request and in database documents: state cannot be anywhere else, such as in a session cache. The documents in the server define the state of the server. A REST server should explicitly create a state machine that defines the acceptable actions that can transacted against documents in specific contexts.
**A REST transaction occurs at a '''point in time'''. The state of the data in the request is unchanging, but the state of the documents and state machines in the database are often changing. Since request state and database state are both required to process the request and since shifting state creates unpredictable results, a REST transaction should run at a point in time with unchanging state. Only an ACID-compliant database can ensure consistent state because it isolates each transaction from every other transaction. The only time REST does not need an ACID-compliant database is when database documents do not change, database state machines do not change, or when clients can live with the resultant level of unreliable and unpredictable results.
**'''MarkLogic''' meets all the requirements for REST state. Its web services are stateless: there is no session cache. It is ACID compliant. It ensures each transaction occurs at a point in time and is isolated from all other transactions. This ensures consistent processing during a transaction -- even across billions of documents. MarkLogic is an [http://en.wikipedia.org/wiki/Multiversion_concurrency_control MVCC] database which provides transaction isolation without slowing the performance of reads -- even when documents being read are being modified simultaneously by other transactions. (Also, like any other ACID database, when multiple updates and deletes compete for the same documents, they will impact each other's performance because change has to be serialized.)
** Most other NoSQL databases are not ACID compliant. They are only suitable for REST services when their documents or data do not change or when the rate of change is slow enough or dispersed enough that it creates an acceptable level of unreliability and unpredictability.

* '''Client:''' A REST client, such as a web browser or spider, locally maintains transactional state, such as what to do in response to documents and result codes that are returned from server transactions. An application exists only in the client -- not in the server (although, a server may deliver application code to a client, such as when a web server downloads HTML, CSS, and JavaScript to a browser). Client application code decides when to execute web service calls and it ties the results together to accomplish its purpose. This allows multiple authorized applications to reuse web services for a variety of purposes.
**The server helps the client know what web service calls are available by providing action links with each response. Action links are contextual and the context is based on the application account, user account, database documents, links within and between documents, the server state machine, etc. Through action links, the server can inform a client application what web service calls are permissible in any given context.
**'''MarkLogic''' meets the needs of client applications through its built-in ability to process and send action links to clients based on context. MarkLogic supports RDF triples and SPARQL, which enables context to be defined across applications, users, document state, links between documents, server state machine, etc. MarkLogic processes triples very quickly, which enables context to scale to billions of documents, millions of users, etc.
**'''MarkLogic''' also uses application and user permissions to filter which documents are returned to clients. MarkLogic does this automatically and transparently by adding security filtering constraints into every search and query. This ensures no account can access unauthorized documents. This is fast because all security permissions are built-into MarkLogic's indexes -- which allows document-level security to scale across billions of documents.
** Most other NoSQL databases do not provide government-grade, document-level security and they also do not support RDF triples and SPARQL.

Because MarkLogic can provide both the web service and database in one server, it is easy to use the state of the documents and the

=== '''Transfer and MarkLogic''' ===
* '''Transfer''' in REST is a communication protocol that enables a client to send a hypertext request to a server and receive back a hypertext response. The transfer must be stateless and be a request/response protocol. It must have human readable, self-descriptive headers. The header must contain metadata about the request, such as the requested resource URI, MIME type of the resource, action to perform on the resource (such as get, put, post, patch, and delete). [http://www.w3schools.com/tags/ref_httpmethods.asp '''HTTP'''] (Hypertext transfer protocol) and '''HTTPS''' (secure HTTP) are designed specifically for REST (that is why they have "hypertext" in their name).

* All '''MarkLogic''' communication is through through HTTP and HTTPS REST services (except for its SQL JDBC feature). This includes all internal cluster communication. MarkLogic provides out-of-the-box REST interfaces for manipulating resources (insert, update, delete, query, search, transform, etc.) and administering MarkLogic (REST app servers, databases, indexes, clusters, etc). MarkLogic makes it very easy to create custom REST services because everything in MarkLogic is built around REST and because they provide simple and powerful application server APIs.

== MarkLogic Query and Search ==
MarkLogic is unlike relational databases (and a few NoSQL databases) where they use full table scans to retrieve every row in the table unless a query accesses less than 20% of the rows and then they use an index. MarkLogic does not do full table scans. It needs all queries to be resolved out of the indexes. For that reason, it indexes most things and allows you to index more.

MarkLogic can query and search -- and use both in combination. MarkLogic uses indexes to execute searches and queries. The same indexes are used for both. The difference between search and query is how documents are sorted. Searches sort by relevance and queries sort by values. MarkLogic uses indexes to extract words, values, structures, and links out of documents.

This enables you to search and query for documents as if they contained only
*'''Words:''' a flat set of words without structure, such as <code>["Mary" "had", "a", "little", "lamb"]</code>
*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>
*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>
*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>
*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary" "had", "a", "little", "lamb"]</code>
*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>
*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.
*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.
*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>
*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>
*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>
*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>
*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>
*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>
*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

The [[MarkLogic Query and Search]] page provides more detail about how MarkLogic's unique indexes work and how they work together.

[[Category: MarkLogic]]

MarkLogic

2014-11-14T08:22:52Z

Bowersmt: Changed category

What is MarkLogic?

2014-11-14T08:17:04Z

Bowersmt: Changed category

== What is MarkLogic? ==
[http://www.marklogic.com/what-is-marklogic/ MarkLogic] is a [http://www.marklogic.com/what-is-marklogic/inside-marklogic/ '''REST application server''', '''document database''', and '''search engine''']. It is REST through and through. It is built specifically for hypertext documents, links, metadata, URIs, MIME types, and HTTP. It is schema-agnostic because it is automatically aware of the independent structure of each of its JSON and XML documents. It is search-centric because it can search for ''any combination'' of words, values, structures, and links within and across documents. It scales horizontally to hundreds of servers within and between data centers while maintaining ACID-compliant transactions.

MarkLogic has the following [http://www.marklogic.com/what-is-marklogic/enterprise-nosql/ enterprise features]:
* [http://www.marklogic.com/resources/java-developers-guide/ Java APIs]
* [http://en.wikipedia.org/wiki/Node.js Node.js APIs]
* [http://developer.marklogic.com/learn/2009-07-search-api-walkthrough Search] and [http://developer.marklogic.com/learn/arch/search-and-indexing Query]
* [http://www.marklogic.com/blog/can-you-pass-the-acid-test/ ACID Transactions]
* [http://www.marklogic.com/resources/marklogic-high-availability-and-disaster-recovery/resource_download/datasheets/ High Availability] and [http://docs.marklogic.com/guide/database-replication/dbrep_intro#chapter Disaster Recovery]
* [http://www.marklogic.com/resources/marklogic-flexible-replication/resource_download/datasheets/ Replication within and across data centers]
* [http://docs.marklogic.com/guide/admin/security#chapter Government-grade Security]
* [https://docs.marklogic.com/guide/cluster.pdf Scalability] and [https://docs.marklogic.com/guide/admin/database-rebalancing#chapter Elasticity]
* [https://docs.marklogic.com/guide/ec2/managing#chapter On-premise or Cloud Deployment (especially AWS)]
* [http://www.marklogic.com/resources/marklogic-and-hadoop/resource_download/datasheets/ Hadoop for Storage and Compute]
* [http://www.marklogic.com/resources/marklogic-semantics-mlw14/ Semantics]

== What is REST? ==
REST is an architectural style that uses a uniform resource identifier ('''URI''') and a web protocol ('''HTTP/HTTPS''') to request and transfer a representation ('''MIME media type''') of the state of a resource ('''document''') at a point in time from a server to a client.

REST was coined and defined by Roy Fielding in his
[http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm dissertation, ''Architectural Styles and the Design of Network-based Software Architectures'']. REST standardizes and documents the patterns used in the world wide web: self-documenting hypertext with data and metadata (HTML), stateless request/response communication protocol (HTTP/HTTPS), resource locators (URL), multiple representations per URL (MIME types), and downloading code on demand to process resources (JavaScript, CSS, etc.).

REST consists of three main concepts:
* (RE) Representation
* (S) State
* (T) Transfer

== Why is MarkLogic RESTful? ==

=== '''Representation and MarkLogic''' ===
* A representation is a document that represents a resource. It has three requirements: MIME media type, URI, and hypertext.

* '''MIME type:''' Each resource can be represented by one or more types of documents. The MIME media type defines the representation a client requests. A client may request the resource to be represented as JSON, XML, HTML, PNG, PDF, etc.
**'''MarkLogic''' stores any type of document and assigns the appropriate [http://developer.marklogic.com/blog/document-formats-part2 MIME type] to it. It can fully index, query, search, and process JSON, XML, and RDF documents. It meets the REST requirement of being able to transform JSON, XML, and RDF documents into any other MIME type. It knows how to execute JavaScript, XQuery, XSLT, and SPARQL documents. It knows how [http://docs.marklogic.com/guide/cpf/default#chapter to transform into XHTML] the content, formatting, and structure of Microsoft Word, PowerPoint, Excel, textual PDF, DocBook, and CSS documents. (It can also use Microsoft Office to create, edit, and manage content in MarkLogic.) It knows how to [https://docs.marklogic.com/guide/search-dev/binary-document-metadata extract metadata and text from over 138 types of binary documents], such as raster images, vector images, videos, archive files, database files, encoded emails, presentations, spreadsheets, word-processing documents, text formats.
**Few NoSQL databases use MIME types to identify the media type of each document. Most NoSQL databases support only one type of data and it is usually proprietary: columnar, BSON, binary, etc. Most cannot transform from one MIME type to another.

* '''URI:''' A resource is identified by a globally unique identifier (URI). '''MarkLogic''' identifies each document with a [http://developer.marklogic.com/try/rest/page2 unique URI]. A document in MarkLogic is like a row in a table in a schema in a relational database. A URI is liberating. It provides random access to any resource anywhere. It is like being able to retrieve any row in any table in any schema of a relational database without having to know what table and schema the row is stored in.
**Navigating URL hierarchy is fundamental to REST. MarkLogic understands the hierarchy within a URI, which is represented by slashes "/", such as https://www.lds.org/scriptures/ot/gen/1 MarkLogic treats the items between the slashes as folders. The URI of each document automatically places it in a folder in a folder hierarchy. A URI automatically defines the folder hierarchy. In the example URI above, the document for Genesis chapter 1, is located in the Genesis folder, which is located in the Old Testament folder, which is located in the Scriptures folder on lds.org. '''MarkLogic''' indexes the documents in each folder and its subfolders. This makes it fast and easy to retrieve any or all documents in any folder and/or its subfolders.
**Few NoSQL databases use the URI as the primary key for their documents or data. They also don't index the URI hierarchically to filter documents by folder and subfolder.

* '''Hypertext:''' A document should contain data that represents the resource. The data should be '''human readable and self-documenting''', like JSON, XML, RDF, and HTML. It should be '''linked data''' (i.e. the "hyper" in "hypertext"). A document should contain ''' metadata links''' about the resource, such as RDF. It should contain '''action links''' to define what further actions can be done with the resource. It should contain '''related links''' to related resources, such as images, audio, video, related documents, etc. Each related link should define what actions can be done with the referenced resource, such as download it, display a link to it, execute a command against it, etc.)
**'''Hypertext''' or '''hypermedia''' documents must have all these features. Hyperlinks are what the "hyper" in hypertext and hypermedia refers to. You can't have REST without ''metadata links'' to define what the data means, ''action links'' to know how to work with the resource, and ''related links'' to connect resources. It should all be human readable and self-documenting so a developer does not have to read documentation to know how to interact with a REST web service and its documents.
** '''MarkLogic''' meets all the requirements for hypertext representation. It is designed around MIME types, URIs, and Linked data. It stores documents with their MIME types as [http://www.w3schools.com/json/ JSON], [http://www.w3schools.com/xml/ XML], [http://www.w3schools.com/webservices/ws_rdf_intro.asp RDF], [http://www.w3schools.com/html/ HTML], etc. These documents are human-readable and self-documenting, which MarkLogic leverages to recognize and index each document's data, data structure, metadata links, action links, and related links. This makes it easy to '''search''', '''query''', '''transform''', and '''deliver''' hypertext documents. MarkLogic can also store any type of binary document and deliver it as a related resource, such as an image, video, or audio. MarkLogic is designed to process simple links and RDF links using [http://en.wikipedia.org/wiki/SPARQL SPARQL] and [https://docs.marklogic.com/xinc XInclude]. MarkLogic can represent links in many formats: [https://docs.marklogic.com/xp XPointer], [https://docs.marklogic.com/guide/semantics/loading#id_97709 RDF/XML], [https://docs.marklogic.com/guide/semantics/loading#id_79194 RDF/JSON], [https://docs.marklogic.com/guide/semantics/loading#id_73211 Turtle], [https://docs.marklogic.com/guide/semantics/loading#id_70596 N-Triples], [https://docs.marklogic.com/guide/semantics/loading#id_61596 N-Quads], and [https://docs.marklogic.com/guide/semantics/loading#id_74485 TriG].
**No other NoSQL database natively indexes and fully processes all the document types required for REST hypertext: JSON, XML, RDF, HTML, CSS, and JavaScript.

=== '''State and MarkLogic''' ===
* State in REST exists on the client and in server documents. It does not exist in the communications protocol or in the server as cached session data.

* '''Server:''' All information needed to process a request must be presented in the request and processed against documents in the database. State must only be in the request and in database documents: state cannot be anywhere else, such as in a session cache. The documents in the server define the state of the server. A REST server should explicitly create a state machine that defines the acceptable actions that can transacted against documents in specific contexts.
**A REST transaction occurs at a '''point in time'''. The state of the data in the request is unchanging, but the state of the documents and state machines in the database are often changing. Since request state and database state are both required to process the request and since shifting state creates unpredictable results, a REST transaction should run at a point in time with unchanging state. Only an ACID-compliant database can ensure consistent state because it isolates each transaction from every other transaction. The only time REST does not need an ACID-compliant database is when database documents do not change, database state machines do not change, or when clients can live with the resultant level of unreliable and unpredictable results.
**'''MarkLogic''' meets all the requirements for REST state. Its web services are stateless: there is no session cache. It is ACID compliant. It ensures each transaction occurs at a point in time and is isolated from all other transactions. This ensures consistent processing during a transaction -- even across billions of documents. MarkLogic is an [http://en.wikipedia.org/wiki/Multiversion_concurrency_control MVCC] database which provides transaction isolation without slowing the performance of reads -- even when documents being read are being modified simultaneously by other transactions. (Also, like any other ACID database, when multiple updates and deletes compete for the same documents, they will impact each other's performance because change has to be serialized.)
** Most other NoSQL databases are not ACID compliant. They are only suitable for REST services when their documents or data do not change or when the rate of change is slow enough or dispersed enough that it creates an acceptable level of unreliability and unpredictability.

* '''Client:''' A REST client, such as a web browser or spider, locally maintains transactional state, such as what to do in response to documents and result codes that are returned from server transactions. An application exists only in the client -- not in the server (although, a server may deliver application code to a client, such as when a web server downloads HTML, CSS, and JavaScript to a browser). Client application code decides when to execute web service calls and it ties the results together to accomplish its purpose. This allows multiple authorized applications to reuse web services for a variety of purposes.
**The server helps the client know what web service calls are available by providing action links with each response. Action links are contextual and the context is based on the application account, user account, database documents, links within and between documents, the server state machine, etc. Through action links, the server can inform a client application what web service calls are permissible in any given context.
**'''MarkLogic''' meets the needs of client applications through its built-in ability to process and send action links to clients based on context. MarkLogic supports RDF triples and SPARQL, which enables context to be defined across applications, users, document state, links between documents, server state machine, etc. MarkLogic processes triples very quickly, which enables context to scale to billions of documents, millions of users, etc.
**'''MarkLogic''' also uses application and user permissions to filter which documents are returned to clients. MarkLogic does this automatically and transparently by adding security filtering constraints into every search and query. This ensures no account can access unauthorized documents. This is fast because all security permissions are built-into MarkLogic's indexes -- which allows document-level security to scale across billions of documents.
** Most other NoSQL databases do not provide government-grade, document-level security and they also do not support RDF triples and SPARQL.

Because MarkLogic can provide both the web service and database in one server, it is easy to use the state of the documents and the

=== '''Transfer and MarkLogic''' ===
* '''Transfer''' in REST is a communication protocol that enables a client to send a hypertext request to a server and receive back a hypertext response. The transfer must be stateless and be a request/response protocol. It must have human readable, self-descriptive headers. The header must contain metadata about the request, such as the requested resource URI, MIME type of the resource, action to perform on the resource (such as get, put, post, patch, and delete). [http://www.w3schools.com/tags/ref_httpmethods.asp '''HTTP'''] (Hypertext transfer protocol) and '''HTTPS''' (secure HTTP) are designed specifically for REST (that is why they have "hypertext" in their name).

* All '''MarkLogic''' communication is through through HTTP and HTTPS REST services (except for its SQL JDBC feature). This includes all internal cluster communication. MarkLogic provides out-of-the-box REST interfaces for manipulating resources (insert, update, delete, query, search, transform, etc.) and administering MarkLogic (REST app servers, databases, indexes, clusters, etc). MarkLogic makes it very easy to create custom REST services because everything in MarkLogic is built around REST and because they provide simple and powerful application server APIs.

== MarkLogic Query and Search ==
MarkLogic is unlike relational databases (and a few NoSQL databases) where they use full table scans to retrieve every row in the table unless a query accesses less than 20% of the rows and then they use an index. MarkLogic does not do full table scans. It needs all queries to be resolved out of the indexes. For that reason, it indexes most things and allows you to index more.

MarkLogic can query and search -- and use both in combination. MarkLogic uses indexes to execute searches and queries. The same indexes are used for both. The difference between search and query is how documents are sorted. Searches sort by relevance and queries sort by values. MarkLogic uses indexes to extract words, values, structures, and links out of documents.

This enables you to search and query for documents as if they contained only
*'''Words:''' a flat set of words without structure, such as <code>["Mary" "had", "a", "little", "lamb"]</code>
*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>
*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>
*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>
*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary" "had", "a", "little", "lamb"]</code>
*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>
*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.
*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.
*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>
*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>
*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>
*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>
*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>
*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>
*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

The [[MarkLogic Query and Search]] page provides more detail about how MarkLogic's unique indexes work and how they work together.

[[Categories: MarkLogic]]

MarkLogic

2014-11-14T08:15:38Z

Bowersmt: Fixed category to MarkLogic

MarkLogic Query and Search

2014-11-14T08:14:26Z

Bowersmt: Indented a line

MarkLogic Query and Search

2014-11-14T08:12:50Z

Bowersmt: Created

== MarkLogic Query and Search ==
Indexes are central to everything you do in MarkLogic. They are central to how MarkLogic stores data, scales massively, and processes data quickly. They are central to how you develop because you have to use MarkLogic's indexes to query and search documents. You have to understand how MarkLogic's indexes work to be able to create fast queries. And you have to know how to model document structure to work best with indexes.

=== How does MarkLogic compare to a relational database? ===
A document in MarkLogic is very much like a row in a relational database, and a document type ties documents together like a table ties rows together. Both use indexes to resolve queries without having to scan through rows or documents.

Unlike relational databases, MarkLogic can't do full table scans; queries must use indexes. In contrast, relational databases often use a full table scan to retrieve every row in a table. A rule of thumb in relational databases is that it is faster to directly read in ''all'' the rows when a query accesses more than roughly 25% of the rows. This is because relational databases use variations of the B-Tree index, which requires 3-4 random IOs to find one row and 1 IO to retrieve the row: this is 4-5 total IOs per row. B-Tree indexes work best for single row lookups, but they become increasingly costly the more rows they retrieve. If an index has to lookup 25% of its rows with 4 IOs per row, the database will do the same amount of IO if it reads in all of the rows (1 IO per row). Since sequential IO on disks is faster than random IO, relational databases prefer full table scans when accessing more than about 25% of rows in a table. On analytical data warehouses, full table scans are always preferred.

In addition, a relational database [https://docs.oracle.com/database/121/TGSQL/tgsql_optop.htm#TGSQL234 often directly scans through index data] rather than stepping through its B-Tree structure. Databases use a variety of index scans: full scans, range scans, skip scans, join scans, etc. This is index and table records are stored sequentially on disk. This often makes it faster to take advantage of the speed of sequential disk IO and directly load all index or table rows into RAM and process them in RAM; as opposed to walking a b-tree structure and doing multiple slow random IOs to retrieve each record. For the same reason, row and index scans work well in columnar NoSQL databases.

=== Why do MarkLogic queries have to use Indexes? ===
To maximize parallelism and to spread processing throughout a cluster, MarkLogic spreads documents across the cluster. MarkLogic automatically creates multiple shards per server and spreads them across all the servers in the cluster. Documents are placed in a shard in the order they are created or modified; thus, documents of all types may be randomly intermixed in a shard. Since a shard is not organized by document type, doing a sequential scan to resolve a query would require retrieving and processing all documents in the database. Full scans are not practical in MarkLogic. All queries and searches need to use indexes. For this reason, MarkLogic automatically indexes almost everything. You can also create additional specialized indexes. MarkLogic indexes are not B-Tree indexes; they are designed to be sharded and queried in parallel across stands, forests, and a database cluster that spans many servers.

=== How does MarkLogic scale horizontally? ===
Documents in MarkLogic belong to a database. Within a database are one or more servers. Each server may contain zero or more "forests". Within each forest are one or more "stands". Within each stand are one or more "trees". A document is hierarchical so it is called a "tree". (MarkLogic uses the terms "stand" and "forest" instead of "shard", but they have the same meaning.)

A client contacts one of the servers in a MarkLogic cluster and initiates a query or search. Because MarkLogic communicates through REST web services, a load balancer can spread requests across the servers in a MarkLogic cluster. The initiating server does the following
#it validates the query request from the client
#it parses and optimizes the query
#it sends the query to all servers that have forests in the database
##each server sends it to its forests
##each forest sends it to each stand
##each stand uses its indexes to find matching documents
###all stands and forests query the indexes in parallel
##forests combine the matching document IDs from all their stands
##forests sort the matching document IDs (in relevance score order or value order)
##when complete, each forest returns a sorted iterator to the initiating server
#when each forest has returned an iterator
#it walks the sorted iterators from all the forests and combines the results in sorted
#it retrieves matching documents from its ''Expanded Tree Cache'' or from the forests when documents are not cached
##forests retrieve documents from their ''Compressed Tree Cache'' or from disk when documents are not cached
#it filters returned documents to remove false positive matches
#it optionally transforms documents
##it creates new documents by extracting matching elements, changing structure, changing file format, etc.
#it returns all matching (optionally transformed) documents to the requester
#it saves resulting document IDs when paging query results, so it can return subsequent pages of matching documents in subsequent requests

The initiating server's process is CPU intensive, serialized, and blocks while waiting on forests. To prevent this from being a bottleneck to the cluster, you can dedicate nodes in the cluster to do nothing but evaluate queries. These are called e-nodes (evaluation nodes). These nodes don't hold data. If they held data, the ''serialized'' evaluation process would compete with the ''parallel'' index matching process. Instead e-nodes are dedicated to validating, parsing, sorting, filtering, transforming, and paginating documents. They cache documents in the ''Expanded Tree Cache'' to minimize the number of documents they have to retrieve from data nodes (d-nodes). D-nodes only do parallel index processing and sorting. They cache documents in their ''Compressed Tree Cache'' to reduce disk IO.

=== How do you optimize a MarkLogic cluster to run queries faster? ===
*E-nodes and d-nodes can be scaled independently to match the load.
*You can optimize an e-node by minimizing its ''Compressed Tree Cache'' and maximizing its ''Expanded Tree Cache''. An e-node needs a big ''Expanded Tree Cache'' so it doesn't have to retrieve as many documents from d-nodes. An e-node doesn't need a ''Compressed Tree Cache'' because it doesn't have any data.
*You can optimize a d-node by doing the opposite of the e-node: maximize the ''Compressed Tree Cache'' and minimize the ''Expanded Tree Cache''. You can also optimize d-nodes by making sure each d-node runs on similar hardware and that each forest has a similar number of documents. This is important because all forests process queries in parallel and the slowest forest holds up each query: e-nodes have to wait until the slowest forest finishes.
*You can optimize a query by making it unfiltered.
**This means it is completely resolved by the indexes. Because no filtering is required, the initiating server doesn't have to verify that documents match.
*You can optimize a query by eliminating false positives.
**False positives occur when a query identifies a possible match for a document, but it cannot prove it is a match until the filtering process opens the document and verifies it.
**False positives happen because MarkLogic only partially indexes certain document items. MarkLogic cannot index everything because it would take too much time and space. For example, MarkLogic has a structure index. It indexes relationships between elements in a document. MarkLogic cannot afford to index each element's path from itself to every other element. MarkLogic compromises on indexing only parent/child relationships. This is effective in eliminating most non-matching documents, but it can produce false positives. The only way to know if a document is an ''exact'' match, MarkLogic must open the document and verify it.
**For example, suppose a query searches for the following path: <code>books/poetry/poem</code> The structure index will match documents containing books/poetry and poetry/poem. The resulting documents are a probable match, but could contain false positives, such as <code>books/poetry/fantasy/poetry/poem</code>.
**You can eliminate false positive structures queries by giving elements unique names. They are unique no matter where in they are in the structure and they are more self-descriptive. Instead of naming an element "line" which can mean many things, you can use the precise name "poemTextLine".
*You can optimize a query by greatly limiting how many documents it matches. Design each query to return as few documents as possible.
**It is expensive to transmit documents. Matching documents have to travel from stands to forests to the initiating server, and to the requester.
**It is expensive to process documents on the initiating server and the requester.
**It takes processing time on d-nodes to compare lists of document IDs. The fewer the matching document IDs in one part of the query, the faster the entire query runs.
**The more matching document IDs, the more work MarkLogic has to do in a d-node to compare, union, intersect, and subract them from matching document IDs returned by other parts of the query.
**The more matching document IDs, the more sorting MarkLogic has to do in the stands, forests, and initiating server.

=== How does a MarkLogic query compare to map/reduce? ===
A MarkLogic query or search is similar to map/reduce. When you query or search for documents in a database, MarkLogic spreads the work across all the servers to be executed in parallel. This is like a ''map'' process because indexes filter out non-matching documents by combining term, range, and semantic indexes according to the query map. The results are then ''reduced'' at the forest level and again at the database level and again at the initiating server. This parallel division of responsibilities enables MarkLogic to scale.

== Term, Range, and Semantic Indexes ==
MarkLogic indexes are not B-Tree indexes. MarkLogic uses '''term''', '''range''', and '''semantic''' indexes.
*'''A term index''' associates a term with every document that contains the term. For example, in a word index, every word in the database is stored in the index. Associated with each word in the index is a list of each document ID that contains the term.
*'''A range index''' is a double term index: it contains each term and all document IDs that contain that term, and it contains each document ID and all the terms that are in the document.
*'''A semantic index''' is similar to a range index but it is optimized to store triples (three pieces of data: subject, object, and predicate).

Term indexes are fast. Give the index a term, and with one lookup, it returns ''all'' documents that contain the term. Term indexes are memory mapped files. They run in RAM unless you run out of RAM and then firmware in the CPU efficiently swaps them to and from disk. Thus, it is best to have enough RAM in a MarkLogic server to contain all indexes in memory.

Unlike B-Tree indexes, term indexes are easily sharded within and between servers. MarkLogic takes advantage of this to run queries in parallel within and across servers. This is how MarkLogic can scale to billions of documents and still return queries and searches in milliseconds.

When documents are inserted or updated, MarkLogic indexes them in RAM (as well as appends the changes to a journal on disk). When documents in RAM start to consume too much RAM, MarkLogic writes them to disk as a "stand" in a "forest" and starts creating a new "stand" in RAM.

When you query or search for documents, MarkLogic goes to the indexes in the stands (which are cached in RAM) and evaluates the query in parallel. Each stand index returns a list of document IDs (a list of numbers) that match the query. MarkLogic takes the document IDs returned by each stand and merges them into a final list of document IDs, whcih it uses to retrieve the documents from cache or disk. This works well because computers are very fast at sort merging lists of numbers.

MarkLogic uses the same indexes for both queries and searches. This allows you to combine search and query expressions. The main difference between search and query is how documents are sorted: searches sort by relevance and queries sort by values. Another difference is that queries tend to use value indexes and searches tend to use word and phrase indexes.

== What specific indexes does MarkLogic have? ==
MarkLogic uses indexes to extract words, values, structures, and links out of documents. An index is an altered view of a document. It enables you to search and query for documents '''as if they contained only:'''
*'''Words:''' a flat set of words without structure, such as <code>["Mary", "had", "a", "little", "lamb"]</code>
*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>
*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>
*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>
*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary", "had", "a", "little", "lamb"]</code>
*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>
*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.
*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.
*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>
*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>
*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>
*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>
*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>
*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>
*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

Search and query expressions are fully composable; i.e. they can be nested inside each other and combined using AND, OR, and NOT expressions.

== Training Resources for MarkLogic Search ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities.

[[Categories: MarkLogic]]

What is MarkLogic?

2014-11-14T02:06:32Z

Bowersmt: Changed link name

== What is MarkLogic? ==
[http://www.marklogic.com/what-is-marklogic/ MarkLogic] is a [http://www.marklogic.com/what-is-marklogic/inside-marklogic/ '''REST application server''', '''document database''', and '''search engine''']. It is REST through and through. It is built specifically for hypertext documents, links, metadata, URIs, MIME types, and HTTP. It is schema-agnostic because it is automatically aware of the independent structure of each of its JSON and XML documents. It is search-centric because it can search for ''any combination'' of words, values, structures, and links within and across documents. It scales horizontally to hundreds of servers within and between data centers while maintaining ACID-compliant transactions.

MarkLogic has the following [http://www.marklogic.com/what-is-marklogic/enterprise-nosql/ enterprise features]:
* [http://www.marklogic.com/resources/java-developers-guide/ Java APIs]
* [http://en.wikipedia.org/wiki/Node.js Node.js APIs]
* [http://developer.marklogic.com/learn/2009-07-search-api-walkthrough Search] and [http://developer.marklogic.com/learn/arch/search-and-indexing Query]
* [http://www.marklogic.com/blog/can-you-pass-the-acid-test/ ACID Transactions]
* [http://www.marklogic.com/resources/marklogic-high-availability-and-disaster-recovery/resource_download/datasheets/ High Availability] and [http://docs.marklogic.com/guide/database-replication/dbrep_intro#chapter Disaster Recovery]
* [http://www.marklogic.com/resources/marklogic-flexible-replication/resource_download/datasheets/ Replication within and across data centers]
* [http://docs.marklogic.com/guide/admin/security#chapter Government-grade Security]
* [https://docs.marklogic.com/guide/cluster.pdf Scalability] and [https://docs.marklogic.com/guide/admin/database-rebalancing#chapter Elasticity]
* [https://docs.marklogic.com/guide/ec2/managing#chapter On-premise or Cloud Deployment (especially AWS)]
* [http://www.marklogic.com/resources/marklogic-and-hadoop/resource_download/datasheets/ Hadoop for Storage and Compute]
* [http://www.marklogic.com/resources/marklogic-semantics-mlw14/ Semantics]

== What is REST? ==
REST is an architectural style that uses a uniform resource identifier ('''URI''') and a web protocol ('''HTTP/HTTPS''') to request and transfer a representation ('''MIME media type''') of the state of a resource ('''document''') at a point in time from a server to a client.

REST was coined and defined by Roy Fielding in his
[http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm dissertation, ''Architectural Styles and the Design of Network-based Software Architectures'']. REST standardizes and documents the patterns used in the world wide web: self-documenting hypertext with data and metadata (HTML), stateless request/response communication protocol (HTTP/HTTPS), resource locators (URL), multiple representations per URL (MIME types), and downloading code on demand to process resources (JavaScript, CSS, etc.).

REST consists of three main concepts:
* (RE) Representation
* (S) State
* (T) Transfer

== Why is MarkLogic RESTful? ==

=== '''Representation and MarkLogic''' ===
* A representation is a document that represents a resource. It has three requirements: MIME media type, URI, and hypertext.

* '''MIME type:''' Each resource can be represented by one or more types of documents. The MIME media type defines the representation a client requests. A client may request the resource to be represented as JSON, XML, HTML, PNG, PDF, etc.
**'''MarkLogic''' stores any type of document and assigns the appropriate [http://developer.marklogic.com/blog/document-formats-part2 MIME type] to it. It can fully index, query, search, and process JSON, XML, and RDF documents. It meets the REST requirement of being able to transform JSON, XML, and RDF documents into any other MIME type. It knows how to execute JavaScript, XQuery, XSLT, and SPARQL documents. It knows how [http://docs.marklogic.com/guide/cpf/default#chapter to transform into XHTML] the content, formatting, and structure of Microsoft Word, PowerPoint, Excel, textual PDF, DocBook, and CSS documents. (It can also use Microsoft Office to create, edit, and manage content in MarkLogic.) It knows how to [https://docs.marklogic.com/guide/search-dev/binary-document-metadata extract metadata and text from over 138 types of binary documents], such as raster images, vector images, videos, archive files, database files, encoded emails, presentations, spreadsheets, word-processing documents, text formats.
**Few NoSQL databases use MIME types to identify the media type of each document. Most NoSQL databases support only one type of data and it is usually proprietary: columnar, BSON, binary, etc. Most cannot transform from one MIME type to another.

* '''URI:''' A resource is identified by a globally unique identifier (URI). '''MarkLogic''' identifies each document with a [http://developer.marklogic.com/try/rest/page2 unique URI]. A document in MarkLogic is like a row in a table in a schema in a relational database. A URI is liberating. It provides random access to any resource anywhere. It is like being able to retrieve any row in any table in any schema of a relational database without having to know what table and schema the row is stored in.
**Navigating URL hierarchy is fundamental to REST. MarkLogic understands the hierarchy within a URI, which is represented by slashes "/", such as https://www.lds.org/scriptures/ot/gen/1 MarkLogic treats the items between the slashes as folders. The URI of each document automatically places it in a folder in a folder hierarchy. A URI automatically defines the folder hierarchy. In the example URI above, the document for Genesis chapter 1, is located in the Genesis folder, which is located in the Old Testament folder, which is located in the Scriptures folder on lds.org. '''MarkLogic''' indexes the documents in each folder and its subfolders. This makes it fast and easy to retrieve any or all documents in any folder and/or its subfolders.
**Few NoSQL databases use the URI as the primary key for their documents or data. They also don't index the URI hierarchically to filter documents by folder and subfolder.

* '''Hypertext:''' A document should contain data that represents the resource. The data should be '''human readable and self-documenting''', like JSON, XML, RDF, and HTML. It should be '''linked data''' (i.e. the "hyper" in "hypertext"). A document should contain ''' metadata links''' about the resource, such as RDF. It should contain '''action links''' to define what further actions can be done with the resource. It should contain '''related links''' to related resources, such as images, audio, video, related documents, etc. Each related link should define what actions can be done with the referenced resource, such as download it, display a link to it, execute a command against it, etc.)
**'''Hypertext''' or '''hypermedia''' documents must have all these features. Hyperlinks are what the "hyper" in hypertext and hypermedia refers to. You can't have REST without ''metadata links'' to define what the data means, ''action links'' to know how to work with the resource, and ''related links'' to connect resources. It should all be human readable and self-documenting so a developer does not have to read documentation to know how to interact with a REST web service and its documents.
** '''MarkLogic''' meets all the requirements for hypertext representation. It is designed around MIME types, URIs, and Linked data. It stores documents with their MIME types as [http://www.w3schools.com/json/ JSON], [http://www.w3schools.com/xml/ XML], [http://www.w3schools.com/webservices/ws_rdf_intro.asp RDF], [http://www.w3schools.com/html/ HTML], etc. These documents are human-readable and self-documenting, which MarkLogic leverages to recognize and index each document's data, data structure, metadata links, action links, and related links. This makes it easy to '''search''', '''query''', '''transform''', and '''deliver''' hypertext documents. MarkLogic can also store any type of binary document and deliver it as a related resource, such as an image, video, or audio. MarkLogic is designed to process simple links and RDF links using [http://en.wikipedia.org/wiki/SPARQL SPARQL] and [https://docs.marklogic.com/xinc XInclude]. MarkLogic can represent links in many formats: [https://docs.marklogic.com/xp XPointer], [https://docs.marklogic.com/guide/semantics/loading#id_97709 RDF/XML], [https://docs.marklogic.com/guide/semantics/loading#id_79194 RDF/JSON], [https://docs.marklogic.com/guide/semantics/loading#id_73211 Turtle], [https://docs.marklogic.com/guide/semantics/loading#id_70596 N-Triples], [https://docs.marklogic.com/guide/semantics/loading#id_61596 N-Quads], and [https://docs.marklogic.com/guide/semantics/loading#id_74485 TriG].
**No other NoSQL database natively indexes and fully processes all the document types required for REST hypertext: JSON, XML, RDF, HTML, CSS, and JavaScript.

=== '''State and MarkLogic''' ===
* State in REST exists on the client and in server documents. It does not exist in the communications protocol or in the server as cached session data.

* '''Server:''' All information needed to process a request must be presented in the request and processed against documents in the database. State must only be in the request and in database documents: state cannot be anywhere else, such as in a session cache. The documents in the server define the state of the server. A REST server should explicitly create a state machine that defines the acceptable actions that can transacted against documents in specific contexts.
**A REST transaction occurs at a '''point in time'''. The state of the data in the request is unchanging, but the state of the documents and state machines in the database are often changing. Since request state and database state are both required to process the request and since shifting state creates unpredictable results, a REST transaction should run at a point in time with unchanging state. Only an ACID-compliant database can ensure consistent state because it isolates each transaction from every other transaction. The only time REST does not need an ACID-compliant database is when database documents do not change, database state machines do not change, or when clients can live with the resultant level of unreliable and unpredictable results.
**'''MarkLogic''' meets all the requirements for REST state. Its web services are stateless: there is no session cache. It is ACID compliant. It ensures each transaction occurs at a point in time and is isolated from all other transactions. This ensures consistent processing during a transaction -- even across billions of documents. MarkLogic is an [http://en.wikipedia.org/wiki/Multiversion_concurrency_control MVCC] database which provides transaction isolation without slowing the performance of reads -- even when documents being read are being modified simultaneously by other transactions. (Also, like any other ACID database, when multiple updates and deletes compete for the same documents, they will impact each other's performance because change has to be serialized.)
** Most other NoSQL databases are not ACID compliant. They are only suitable for REST services when their documents or data do not change or when the rate of change is slow enough or dispersed enough that it creates an acceptable level of unreliability and unpredictability.

* '''Client:''' A REST client, such as a web browser or spider, locally maintains transactional state, such as what to do in response to documents and result codes that are returned from server transactions. An application exists only in the client -- not in the server (although, a server may deliver application code to a client, such as when a web server downloads HTML, CSS, and JavaScript to a browser). Client application code decides when to execute web service calls and it ties the results together to accomplish its purpose. This allows multiple authorized applications to reuse web services for a variety of purposes.
**The server helps the client know what web service calls are available by providing action links with each response. Action links are contextual and the context is based on the application account, user account, database documents, links within and between documents, the server state machine, etc. Through action links, the server can inform a client application what web service calls are permissible in any given context.
**'''MarkLogic''' meets the needs of client applications through its built-in ability to process and send action links to clients based on context. MarkLogic supports RDF triples and SPARQL, which enables context to be defined across applications, users, document state, links between documents, server state machine, etc. MarkLogic processes triples very quickly, which enables context to scale to billions of documents, millions of users, etc.
**'''MarkLogic''' also uses application and user permissions to filter which documents are returned to clients. MarkLogic does this automatically and transparently by adding security filtering constraints into every search and query. This ensures no account can access unauthorized documents. This is fast because all security permissions are built-into MarkLogic's indexes -- which allows document-level security to scale across billions of documents.
** Most other NoSQL databases do not provide government-grade, document-level security and they also do not support RDF triples and SPARQL.

Because MarkLogic can provide both the web service and database in one server, it is easy to use the state of the documents and the

=== '''Transfer and MarkLogic''' ===
* '''Transfer''' in REST is a communication protocol that enables a client to send a hypertext request to a server and receive back a hypertext response. The transfer must be stateless and be a request/response protocol. It must have human readable, self-descriptive headers. The header must contain metadata about the request, such as the requested resource URI, MIME type of the resource, action to perform on the resource (such as get, put, post, patch, and delete). [http://www.w3schools.com/tags/ref_httpmethods.asp '''HTTP'''] (Hypertext transfer protocol) and '''HTTPS''' (secure HTTP) are designed specifically for REST (that is why they have "hypertext" in their name).

* All '''MarkLogic''' communication is through through HTTP and HTTPS REST services (except for its SQL JDBC feature). This includes all internal cluster communication. MarkLogic provides out-of-the-box REST interfaces for manipulating resources (insert, update, delete, query, search, transform, etc.) and administering MarkLogic (REST app servers, databases, indexes, clusters, etc). MarkLogic makes it very easy to create custom REST services because everything in MarkLogic is built around REST and because they provide simple and powerful application server APIs.

== MarkLogic Query and Search ==
MarkLogic is unlike relational databases (and a few NoSQL databases) where they use full table scans to retrieve every row in the table unless a query accesses less than 20% of the rows and then they use an index. MarkLogic does not do full table scans. It needs all queries to be resolved out of the indexes. For that reason, it indexes most things and allows you to index more.

MarkLogic can query and search -- and use both in combination. MarkLogic uses indexes to execute searches and queries. The same indexes are used for both. The difference between search and query is how documents are sorted. Searches sort by relevance and queries sort by values. MarkLogic uses indexes to extract words, values, structures, and links out of documents.

This enables you to search and query for documents as if they contained only
*'''Words:''' a flat set of words without structure, such as <code>["Mary" "had", "a", "little", "lamb"]</code>
*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>
*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>
*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>
*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary" "had", "a", "little", "lamb"]</code>
*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>
*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.
*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.
*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>
*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>
*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>
*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>
*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>
*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>
*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

The [[MarkLogic Query and Search]] page provides more detail about how MarkLogic's unique indexes work and how they work together.

[[Categories: MarkLogic]]

What is MarkLogic?

2014-11-14T01:29:28Z

Bowersmt: Creation

== What is MarkLogic? ==
[http://www.marklogic.com/what-is-marklogic/ MarkLogic] is a [http://www.marklogic.com/what-is-marklogic/inside-marklogic/ '''REST application server''', '''document database''', and '''search engine''']. It is REST through and through. It is built specifically for hypertext documents, links, metadata, URIs, MIME types, and HTTP. It is schema-agnostic because it is automatically aware of the independent structure of each of its JSON and XML documents. It is search-centric because it can search for ''any combination'' of words, values, structures, and links within and across documents. It scales horizontally to hundreds of servers within and between data centers while maintaining ACID-compliant transactions.

MarkLogic has the following [http://www.marklogic.com/what-is-marklogic/enterprise-nosql/ enterprise features]:
* [http://www.marklogic.com/resources/java-developers-guide/ Java APIs]
* [http://en.wikipedia.org/wiki/Node.js Node.js APIs]
* [http://developer.marklogic.com/learn/2009-07-search-api-walkthrough Search] and [http://developer.marklogic.com/learn/arch/search-and-indexing Query]
* [http://www.marklogic.com/blog/can-you-pass-the-acid-test/ ACID Transactions]
* [http://www.marklogic.com/resources/marklogic-high-availability-and-disaster-recovery/resource_download/datasheets/ High Availability] and [http://docs.marklogic.com/guide/database-replication/dbrep_intro#chapter Disaster Recovery]
* [http://www.marklogic.com/resources/marklogic-flexible-replication/resource_download/datasheets/ Replication within and across data centers]
* [http://docs.marklogic.com/guide/admin/security#chapter Government-grade Security]
* [https://docs.marklogic.com/guide/cluster.pdf Scalability] and [https://docs.marklogic.com/guide/admin/database-rebalancing#chapter Elasticity]
* [https://docs.marklogic.com/guide/ec2/managing#chapter On-premise or Cloud Deployment (especially AWS)]
* [http://www.marklogic.com/resources/marklogic-and-hadoop/resource_download/datasheets/ Hadoop for Storage and Compute]
* [http://www.marklogic.com/resources/marklogic-semantics-mlw14/ Semantics]

== What is REST? ==
REST is an architectural style that uses a uniform resource identifier ('''URI''') and a web protocol ('''HTTP/HTTPS''') to request and transfer a representation ('''MIME media type''') of the state of a resource ('''document''') at a point in time from a server to a client.

REST was coined and defined by Roy Fielding in his
[http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm dissertation, ''Architectural Styles and the Design of Network-based Software Architectures'']. REST standardizes and documents the patterns used in the world wide web: self-documenting hypertext with data and metadata (HTML), stateless request/response communication protocol (HTTP/HTTPS), resource locators (URL), multiple representations per URL (MIME types), and downloading code on demand to process resources (JavaScript, CSS, etc.).

REST consists of three main concepts:
* (RE) Representation
* (S) State
* (T) Transfer

== Why is MarkLogic RESTful? ==

=== '''Representation and MarkLogic''' ===
* A representation is a document that represents a resource. It has three requirements: MIME media type, URI, and hypertext.

* '''MIME type:''' Each resource can be represented by one or more types of documents. The MIME media type defines the representation a client requests. A client may request the resource to be represented as JSON, XML, HTML, PNG, PDF, etc.
**'''MarkLogic''' stores any type of document and assigns the appropriate [http://developer.marklogic.com/blog/document-formats-part2 MIME type] to it. It can fully index, query, search, and process JSON, XML, and RDF documents. It meets the REST requirement of being able to transform JSON, XML, and RDF documents into any other MIME type. It knows how to execute JavaScript, XQuery, XSLT, and SPARQL documents. It knows how [http://docs.marklogic.com/guide/cpf/default#chapter to transform into XHTML] the content, formatting, and structure of Microsoft Word, PowerPoint, Excel, textual PDF, DocBook, and CSS documents. (It can also use Microsoft Office to create, edit, and manage content in MarkLogic.) It knows how to [https://docs.marklogic.com/guide/search-dev/binary-document-metadata extract metadata and text from over 138 types of binary documents], such as raster images, vector images, videos, archive files, database files, encoded emails, presentations, spreadsheets, word-processing documents, text formats.
**Few NoSQL databases use MIME types to identify the media type of each document. Most NoSQL databases support only one type of data and it is usually proprietary: columnar, BSON, binary, etc. Most cannot transform from one MIME type to another.

* '''URI:''' A resource is identified by a globally unique identifier (URI). '''MarkLogic''' identifies each document with a [http://developer.marklogic.com/try/rest/page2 unique URI]. A document in MarkLogic is like a row in a table in a schema in a relational database. A URI is liberating. It provides random access to any resource anywhere. It is like being able to retrieve any row in any table in any schema of a relational database without having to know what table and schema the row is stored in.
**Navigating URL hierarchy is fundamental to REST. MarkLogic understands the hierarchy within a URI, which is represented by slashes "/", such as https://www.lds.org/scriptures/ot/gen/1 MarkLogic treats the items between the slashes as folders. The URI of each document automatically places it in a folder in a folder hierarchy. A URI automatically defines the folder hierarchy. In the example URI above, the document for Genesis chapter 1, is located in the Genesis folder, which is located in the Old Testament folder, which is located in the Scriptures folder on lds.org. '''MarkLogic''' indexes the documents in each folder and its subfolders. This makes it fast and easy to retrieve any or all documents in any folder and/or its subfolders.
**Few NoSQL databases use the URI as the primary key for their documents or data. They also don't index the URI hierarchically to filter documents by folder and subfolder.

* '''Hypertext:''' A document should contain data that represents the resource. The data should be '''human readable and self-documenting''', like JSON, XML, RDF, and HTML. It should be '''linked data''' (i.e. the "hyper" in "hypertext"). A document should contain ''' metadata links''' about the resource, such as RDF. It should contain '''action links''' to define what further actions can be done with the resource. It should contain '''related links''' to related resources, such as images, audio, video, related documents, etc. Each related link should define what actions can be done with the referenced resource, such as download it, display a link to it, execute a command against it, etc.)
**'''Hypertext''' or '''hypermedia''' documents must have all these features. Hyperlinks are what the "hyper" in hypertext and hypermedia refers to. You can't have REST without ''metadata links'' to define what the data means, ''action links'' to know how to work with the resource, and ''related links'' to connect resources. It should all be human readable and self-documenting so a developer does not have to read documentation to know how to interact with a REST web service and its documents.
** '''MarkLogic''' meets all the requirements for hypertext representation. It is designed around MIME types, URIs, and Linked data. It stores documents with their MIME types as [http://www.w3schools.com/json/ JSON], [http://www.w3schools.com/xml/ XML], [http://www.w3schools.com/webservices/ws_rdf_intro.asp RDF], [http://www.w3schools.com/html/ HTML], etc. These documents are human-readable and self-documenting, which MarkLogic leverages to recognize and index each document's data, data structure, metadata links, action links, and related links. This makes it easy to '''search''', '''query''', '''transform''', and '''deliver''' hypertext documents. MarkLogic can also store any type of binary document and deliver it as a related resource, such as an image, video, or audio. MarkLogic is designed to process simple links and RDF links using [http://en.wikipedia.org/wiki/SPARQL SPARQL] and [https://docs.marklogic.com/xinc XInclude]. MarkLogic can represent links in many formats: [https://docs.marklogic.com/xp XPointer], [https://docs.marklogic.com/guide/semantics/loading#id_97709 RDF/XML], [https://docs.marklogic.com/guide/semantics/loading#id_79194 RDF/JSON], [https://docs.marklogic.com/guide/semantics/loading#id_73211 Turtle], [https://docs.marklogic.com/guide/semantics/loading#id_70596 N-Triples], [https://docs.marklogic.com/guide/semantics/loading#id_61596 N-Quads], and [https://docs.marklogic.com/guide/semantics/loading#id_74485 TriG].
**No other NoSQL database natively indexes and fully processes all the document types required for REST hypertext: JSON, XML, RDF, HTML, CSS, and JavaScript.

=== '''State and MarkLogic''' ===
* State in REST exists on the client and in server documents. It does not exist in the communications protocol or in the server as cached session data.

* '''Server:''' All information needed to process a request must be presented in the request and processed against documents in the database. State must only be in the request and in database documents: state cannot be anywhere else, such as in a session cache. The documents in the server define the state of the server. A REST server should explicitly create a state machine that defines the acceptable actions that can transacted against documents in specific contexts.
**A REST transaction occurs at a '''point in time'''. The state of the data in the request is unchanging, but the state of the documents and state machines in the database are often changing. Since request state and database state are both required to process the request and since shifting state creates unpredictable results, a REST transaction should run at a point in time with unchanging state. Only an ACID-compliant database can ensure consistent state because it isolates each transaction from every other transaction. The only time REST does not need an ACID-compliant database is when database documents do not change, database state machines do not change, or when clients can live with the resultant level of unreliable and unpredictable results.
**'''MarkLogic''' meets all the requirements for REST state. Its web services are stateless: there is no session cache. It is ACID compliant. It ensures each transaction occurs at a point in time and is isolated from all other transactions. This ensures consistent processing during a transaction -- even across billions of documents. MarkLogic is an [http://en.wikipedia.org/wiki/Multiversion_concurrency_control MVCC] database which provides transaction isolation without slowing the performance of reads -- even when documents being read are being modified simultaneously by other transactions. (Also, like any other ACID database, when multiple updates and deletes compete for the same documents, they will impact each other's performance because change has to be serialized.)
** Most other NoSQL databases are not ACID compliant. They are only suitable for REST services when their documents or data do not change or when the rate of change is slow enough or dispersed enough that it creates an acceptable level of unreliability and unpredictability.

* '''Client:''' A REST client, such as a web browser or spider, locally maintains transactional state, such as what to do in response to documents and result codes that are returned from server transactions. An application exists only in the client -- not in the server (although, a server may deliver application code to a client, such as when a web server downloads HTML, CSS, and JavaScript to a browser). Client application code decides when to execute web service calls and it ties the results together to accomplish its purpose. This allows multiple authorized applications to reuse web services for a variety of purposes.
**The server helps the client know what web service calls are available by providing action links with each response. Action links are contextual and the context is based on the application account, user account, database documents, links within and between documents, the server state machine, etc. Through action links, the server can inform a client application what web service calls are permissible in any given context.
**'''MarkLogic''' meets the needs of client applications through its built-in ability to process and send action links to clients based on context. MarkLogic supports RDF triples and SPARQL, which enables context to be defined across applications, users, document state, links between documents, server state machine, etc. MarkLogic processes triples very quickly, which enables context to scale to billions of documents, millions of users, etc.
**'''MarkLogic''' also uses application and user permissions to filter which documents are returned to clients. MarkLogic does this automatically and transparently by adding security filtering constraints into every search and query. This ensures no account can access unauthorized documents. This is fast because all security permissions are built-into MarkLogic's indexes -- which allows document-level security to scale across billions of documents.
** Most other NoSQL databases do not provide government-grade, document-level security and they also do not support RDF triples and SPARQL.

Because MarkLogic can provide both the web service and database in one server, it is easy to use the state of the documents and the

=== '''Transfer and MarkLogic''' ===
* '''Transfer''' in REST is a communication protocol that enables a client to send a hypertext request to a server and receive back a hypertext response. The transfer must be stateless and be a request/response protocol. It must have human readable, self-descriptive headers. The header must contain metadata about the request, such as the requested resource URI, MIME type of the resource, action to perform on the resource (such as get, put, post, patch, and delete). [http://www.w3schools.com/tags/ref_httpmethods.asp '''HTTP'''] (Hypertext transfer protocol) and '''HTTPS''' (secure HTTP) are designed specifically for REST (that is why they have "hypertext" in their name).

* All '''MarkLogic''' communication is through through HTTP and HTTPS REST services (except for its SQL JDBC feature). This includes all internal cluster communication. MarkLogic provides out-of-the-box REST interfaces for manipulating resources (insert, update, delete, query, search, transform, etc.) and administering MarkLogic (REST app servers, databases, indexes, clusters, etc). MarkLogic makes it very easy to create custom REST services because everything in MarkLogic is built around REST and because they provide simple and powerful application server APIs.

== MarkLogic Query and Search ==
MarkLogic is unlike relational databases (and a few NoSQL databases) where they use full table scans to retrieve every row in the table unless a query accesses less than 20% of the rows and then they use an index. MarkLogic does not do full table scans. It needs all queries to be resolved out of the indexes. For that reason, it indexes most things and allows you to index more.

MarkLogic can query and search -- and use both in combination. MarkLogic uses indexes to execute searches and queries. The same indexes are used for both. The difference between search and query is how documents are sorted. Searches sort by relevance and queries sort by values. MarkLogic uses indexes to extract words, values, structures, and links out of documents.

This enables you to search and query for documents as if they contained only
*'''Words:''' a flat set of words without structure, such as <code>["Mary" "had", "a", "little", "lamb"]</code>
*'''Phrases:''' a flat set of phrases without structure, such as <code>["Mary had", "had a", "a little", "little lamb"]</code>
*'''Elements:''' a flat set of elements without values, such as <code>["poem", "text", "line"]</code>
*'''Element-values:''' a flat set of elements with string values, such as <code>"line": "Mary had a little lamb"</code>
*'''Element-words:''' a flat set of elements with words, such as <code>"line": ["Mary" "had", "a", "little", "lamb"]</code>
*'''Element-phrases:''' a flat set of elements with phrases, such as <code>"line": ["Mary had", "had a", "a little", "little lamb"]</code>
*'''Element-value lexicon:''' a flat set of elements with typed values, such as <code>"lineSequence": 3</code> or a list, count, or co-occurrence of element values scoped to a database, set of documents, or one document.
*'''XPath-structure-value lexicon:''' a flat set of hierarchical elements with typed values, such as <code>poem.text.line.lineSequence: 3</code> or a list, count, or co-occurrence of hierarchical element values scoped to a database, set of documents, or one document.
*'''XPath-structure:'''hierarchical structures without values, such as <code>poem.text.line</code>
*'''XPath-structure-words:''' hierarchical structures with words, such as <code>poem.text.line: ["Mary", "had", "a", "little", "lamb"]</code>
*'''XPath-structure-phrases:''' hierarchical structures with phrases, such as <code>poem.text.line: ["Mary had", "had a", "a little", "little lamb"]</code>
*'''XPath-structure-values:''' hierarchical structures with values, such as <code>poem.text.line: "Mary had a little lamb"</code>
*'''RDF document links''' to other documents, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/thatDoc"}</code>
*'''RDF abstract links''' to other abstract concepts, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/relatesSomehowTo", "object": "http://example.org/documents/someConceptWithNoDocumentAtTheURI"}</code>
*'''RDF data links''' to data, such as <code>{"subject": "http://example.org/documents/thisDoc", "predicate": "http://example.org/predicates/ageInYears", "object": 12}</code>

MarkLogic can combine any of these indexes in any way to find matching documents. It can sort the result by relevance or by value.

The [[MarkLogic Search]] page provides more detail about how MarkLogic's unique indexes work and how they work together.

[[Categories: MarkLogic]]

MarkLogic

2014-11-14T00:02:40Z

Bowersmt: Turn this page into a landing page.

Category:MarkLogic

2014-11-13T17:45:54Z

Bowersmt: Created

[[MarkLogic]] is the main landing page for MarkLogic.

MarkLogic Training Resources

2014-11-13T17:44:37Z

Bowersmt: Changed category

== MarkLogic Training Resources ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities, which are its most important feature. Everything you do in MarkLogic should be focused around how MarkLogic indexes, searches, and queries documents.

* [http://developer.marklogic.com/try/ninja/index Try Marklogic] is the best way to play with MarkLogic's search and query features. Without installing anything, you can run queries and searches against an existing database in the cloud.

* [http://developer.marklogic.com/learn MarkLogic Tutorials] guide you step-by-step through each major MarkLogic feature. Some tutorials are short and sweet five minutes, and others take up to an hour or so.

* [https://mlu.marklogic.com/registration/ Live Training] is available from MarkLogic at no cost. A live instructor will work with you to show you how to build MarkLogic applications, administer MarkLogic, use Semantics (triples), and MarkLogic fundamentals (like search and queries).

[[Category:MarkLogic]]

MarkLogic Training Resources

2014-11-13T17:43:46Z

Bowersmt: Created MarkLogic Training Resources

MarkLogic

2014-11-13T17:42:47Z

Bowersmt: Turned this page into the main landing page for MarkLogic

== What is MarkLogic? ==

== Links to other MarkLogic Documents ==
[[Installing MarkLogic]]
[[MarkLogic Training Resources]]

[[Category:MarkLogic]]

Installing MarkLogic

2014-11-13T17:38:19Z

Bowersmt: Created new Installing MarkLogic page

== Installing MarkLogic on Windows ==

(These instructions are for Windows XP, Vista, and 7. Installation and server start-up may differ slightly for other operating systems.)

# [http://developer.marklogic.com/products Download] the latest version of the MarkLogic Server for your operating system.
#* Upon clicking on the download link, you will need to agree to MarkLogic's Terms of Use.
# Execute the MarkLogic Server installer.
#* Choose a "Typical" setup. This will take about 5 minutes.
# After installation is complete start the Mark Logic Server.
#* (Windows Only) Go to Start > All Programs > MarkLogic Server > Start MarkLogic Server
#** Important! Right-click on "Start MarkLogic Server" and select "Run as administrator". Otherwise, the server may not start.
#** If you get the error message "The application failed to initialize properly...", then see [http://developer.marklogic.com/products/marklogic-server/requirements MarkLogic Server 4.x System Requirements] to download and install the necessary dll for your operating system. Then try starting the server again.
#** If start-up succeeds, you will not see any message.
# Go to the Mark Logic administration console:
#* (Windows Only) Go to Start > All Program > MarkLogic Server > Admin MarkLogic Server.
#* A browser window will open and you will be prompted to enter a license key.
#** If you do not yet have a license, do one of the following:
#*** (Employees Only) Request a license key by emailing [mailto:DL-ICS-MARKLOGIC DL-ICS-MARKLOGIC].
#**** Enter the licensee and the license key, then click "OK". The MarkLogic Server will restart.
#*** (Non-employees Only) You may request a free license for community development by clicking the "Free" button and then entering the required information in the supplied form.
#**** '''Licensee'''. This is '''your''' name.
#**** '''Company'''. ''Do NOT put the name of the Church or any of its business entities in this field.'' You may put "Home" or "Community" in this field if you will be using the MarkLogic Server for personal or community development.
#**** '''Email'''. This is your personal email address.
#**** Choose "Select Community License".
#**** Make a copy of your license information and then click "OK". The MarkLogic Server will restart.
#* Accept the license agreement by scrolling to the bottom of the page and clicking "Accept". The MarkLogic Server will restart.
#* You will now be prompted to install the initial databases and application servers. Click "OK" to continue. When installation is complete. The server will restart.
#* When prompted to enter an Admin username and password, enter "admin" as the user and "admin" as the password. Confirm the password and click "OK" to continue.
#* When prompted, enter the user name and password you supplied. You will then be redirected to the MarkLogic Server administration console.
# Configure an XDBC server
#* Expand "Groups", then expand "Default", and then select "App Servers".
#*; [[File:marklogic-admin-step1.png]] 
#* Click the tab "Create XDBC"
#** Supply an XDBC server name.
#*** For the Stack Petstore project, enter "stack-petstore".
#** Supply a value for the module directory root.
#*** For the Stack Petstore project, enter "/".
#** Supply a value for the port the XDBC server will listen to for connections.
#*** For the Stack Petstore project, enter "8010".
#** For the modules field, choose "Modules" from the list of options.
#** Leave all other fields as they are, and click "OK" at the top of the page.
#*; [[File:marklogic-admin-step2.png]] 
# To try out your Mark Logic installation, go to [http://localhost:8000 <nowiki>http://localhost:8000</nowiki>] and login with your admin username.

MarkLogic

2014-11-13T17:37:05Z

Bowersmt: Moving Installing MarkLogic to its own page

== MarkLogic Training ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities, which are its most important feature. Everything you do in MarkLogic should be focused around how MarkLogic indexes, searches, and queries documents.

* [http://developer.marklogic.com/try/ninja/index Try Marklogic] is the best way to play with MarkLogic's search and query features. Without installing anything, you can run queries and searches against an existing database in the cloud.

* [http://developer.marklogic.com/learn MarkLogic Tutorials] guide you step-by-step through each major MarkLogic feature. Some tutorials are short and sweet five minutes, and others take up to an hour or so.

* [https://mlu.marklogic.com/registration/ Live Training] is available from MarkLogic at no cost. A live instructor will work with you to show you how to build MarkLogic applications, administer MarkLogic, use Semantics (triples), and MarkLogic fundamentals (like search and queries).

[[Installing MarkLogic]]

== Installing MarkLogic ==

(These instructions are for Windows XP, Vista, and 7. Installation and server start-up may differ slightly for other operating systems.)

# [http://developer.marklogic.com/products Download] the latest version of the MarkLogic Server for your operating system.
#* Upon clicking on the download link, you will need to agree to MarkLogic's Terms of Use.
# Execute the MarkLogic Server installer.
#* Choose a "Typical" setup. This will take about 5 minutes.
# After installation is complete start the Mark Logic Server.
#* (Windows Only) Go to Start > All Programs > MarkLogic Server > Start MarkLogic Server
#** Important! Right-click on "Start MarkLogic Server" and select "Run as administrator". Otherwise, the server may not start.
#** If you get the error message "The application failed to initialize properly...", then see [http://developer.marklogic.com/products/marklogic-server/requirements MarkLogic Server 4.x System Requirements] to download and install the necessary dll for your operating system. Then try starting the server again.
#** If start-up succeeds, you will not see any message.
# Go to the Mark Logic administration console:
#* (Windows Only) Go to Start > All Program > MarkLogic Server > Admin MarkLogic Server.
#* A browser window will open and you will be prompted to enter a license key.
#** If you do not yet have a license, do one of the following:
#*** (Employees Only) Request a license key by emailing [mailto:DL-ICS-MARKLOGIC DL-ICS-MARKLOGIC].
#**** Enter the licensee and the license key, then click "OK". The MarkLogic Server will restart.
#*** (Non-employees Only) You may request a free license for community development by clicking the "Free" button and then entering the required information in the supplied form.
#**** '''Licensee'''. This is '''your''' name.
#**** '''Company'''. ''Do NOT put the name of the Church or any of its business entities in this field.'' You may put "Home" or "Community" in this field if you will be using the MarkLogic Server for personal or community development.
#**** '''Email'''. This is your personal email address.
#**** Choose "Select Community License".
#**** Make a copy of your license information and then click "OK". The MarkLogic Server will restart.
#* Accept the license agreement by scrolling to the bottom of the page and clicking "Accept". The MarkLogic Server will restart.
#* You will now be prompted to install the initial databases and application servers. Click "OK" to continue. When installation is complete. The server will restart.
#* When prompted to enter an Admin username and password, enter "admin" as the user and "admin" as the password. Confirm the password and click "OK" to continue.
#* When prompted, enter the user name and password you supplied. You will then be redirected to the MarkLogic Server administration console.
# Configure an XDBC server
#* Expand "Groups", then expand "Default", and then select "App Servers".
#*; [[File:marklogic-admin-step1.png]] 
#* Click the tab "Create XDBC"
#** Supply an XDBC server name.
#*** For the Stack Petstore project, enter "stack-petstore".
#** Supply a value for the module directory root.
#*** For the Stack Petstore project, enter "/".
#** Supply a value for the port the XDBC server will listen to for connections.
#*** For the Stack Petstore project, enter "8010".
#** For the modules field, choose "Modules" from the list of options.
#** Leave all other fields as they are, and click "OK" at the top of the page.
#*; [[File:marklogic-admin-step2.png]] 
# To try out your Mark Logic installation, go to [http://localhost:8000 <nowiki>http://localhost:8000</nowiki>] and login with your admin username.

[[Category:LDS Java Stack]]

MarkLogic

2014-11-13T17:33:37Z

Bowersmt: Added MarkLogic training information

{{Under construction|This page is a work in progress.}}

== MarkLogic Training ==
* [http://developer.marklogic.com/blog/grokking-the-cts-api "Grokking the cts API"] is a great overview of MarkLogic's search and query capabilities, which are its most important feature. Everything you do in MarkLogic should be focused around how MarkLogic indexes, searches, and queries documents.

* [http://developer.marklogic.com/try/ninja/index Try Marklogic] is the best way to play with MarkLogic's search and query features. Without installing anything, you can run queries and searches against an existing database in the cloud.

* [http://developer.marklogic.com/learn MarkLogic Tutorials] guide you step-by-step through each major MarkLogic feature. Some tutorials are short and sweet five minutes, and others take up to an hour or so.

* [https://mlu.marklogic.com/registration/ Live Training] is available from MarkLogic at no cost. A live instructor will work with you to show you how to build MarkLogic applications, administer MarkLogic, use Semantics (triples), and MarkLogic fundamentals (like search and queries).

== Installing MarkLogic ==

(These instructions are for Windows XP, Vista, and 7. Installation and server start-up may differ slightly for other operating systems.)

# [http://developer.marklogic.com/products Download] the latest version of the MarkLogic Server for your operating system.
#* Upon clicking on the download link, you will need to agree to MarkLogic's Terms of Use.
# Execute the MarkLogic Server installer.
#* Choose a "Typical" setup. This will take about 5 minutes.
# After installation is complete start the Mark Logic Server.
#* (Windows Only) Go to Start > All Programs > MarkLogic Server > Start MarkLogic Server
#** Important! Right-click on "Start MarkLogic Server" and select "Run as administrator". Otherwise, the server may not start.
#** If you get the error message "The application failed to initialize properly...", then see [http://developer.marklogic.com/products/marklogic-server/requirements MarkLogic Server 4.x System Requirements] to download and install the necessary dll for your operating system. Then try starting the server again.
#** If start-up succeeds, you will not see any message.
# Go to the Mark Logic administration console:
#* (Windows Only) Go to Start > All Program > MarkLogic Server > Admin MarkLogic Server.
#* A browser window will open and you will be prompted to enter a license key.
#** If you do not yet have a license, do one of the following:
#*** (Employees Only) Request a license key by emailing [mailto:DL-ICS-MARKLOGIC DL-ICS-MARKLOGIC].
#**** Enter the licensee and the license key, then click "OK". The MarkLogic Server will restart.
#*** (Non-employees Only) You may request a free license for community development by clicking the "Free" button and then entering the required information in the supplied form.
#**** '''Licensee'''. This is '''your''' name.
#**** '''Company'''. ''Do NOT put the name of the Church or any of its business entities in this field.'' You may put "Home" or "Community" in this field if you will be using the MarkLogic Server for personal or community development.
#**** '''Email'''. This is your personal email address.
#**** Choose "Select Community License".
#**** Make a copy of your license information and then click "OK". The MarkLogic Server will restart.
#* Accept the license agreement by scrolling to the bottom of the page and clicking "Accept". The MarkLogic Server will restart.
#* You will now be prompted to install the initial databases and application servers. Click "OK" to continue. When installation is complete. The server will restart.
#* When prompted to enter an Admin username and password, enter "admin" as the user and "admin" as the password. Confirm the password and click "OK" to continue.
#* When prompted, enter the user name and password you supplied. You will then be redirected to the MarkLogic Server administration console.
# Configure an XDBC server
#* Expand "Groups", then expand "Default", and then select "App Servers".
#*; [[File:marklogic-admin-step1.png]] 
#* Click the tab "Create XDBC"
#** Supply an XDBC server name.
#*** For the Stack Petstore project, enter "stack-petstore".
#** Supply a value for the module directory root.
#*** For the Stack Petstore project, enter "/".
#** Supply a value for the port the XDBC server will listen to for connections.
#*** For the Stack Petstore project, enter "8010".
#** For the modules field, choose "Modules" from the list of options.
#** Leave all other fields as they are, and click "OK" at the top of the page.
#*; [[File:marklogic-admin-step2.png]] 
# To try out your Mark Logic installation, go to [http://localhost:8000 <nowiki>http://localhost:8000</nowiki>] and login with your admin username.

[[Category:LDS Java Stack]]