Getting started with Open XML, ODF, IDML, and other zipped XML documents in MarkLogic

Microsoft Office 2007, Open Office, and Adobe’s InDesign CS4 all have something in common.  These applications all save their documents as .zip archives of XML files.  As an example, take any .docx, .odt, or .idml file and change the file extension to .zip. Extract the contents and within you’ll find multiple interrelated XML documents.  There are probably other document formats that follow this model as well, but these are the 3 I see questions for on a regular basis. When working with documents of this type, the first thing many people want to do is work with the XML within the zip packages.  Today’s post will demonstrate tools provided by MarkLogic Server for working with these document types, as well as some methods for managing and using the extracted pieces.

To play along at home, install MarkLogic Server, and setup CQ.  The following examples work with the default Documents database and default Docs HTTP Server.  I’ll be using Word 2007 documents, as that’s kinda my thing, but you can use the same code, tools, and methods with the other file formats as well. Just replace the .docx in the samples below with a .odt or .idml file, etc., and everything will just auto-magically work. Ultimately they’re all just zip files of XML and associated resources.  Let’s do this!

The Office OpenXML Extract pipeline

It doesn’t get any easier than this.  If you’re working with Office 2007 documents, you can quickly configure the Server to automatically unzip these files when you save them to your database.  The XML within will be saved in a directory named for the original file, maintaining the naming and directory structure they had within the .zip.  To make this happen, we just need to install content processing and attach the Office OpenXML Extract pipeline.

Install Content Processing

In the Admin UI, navigate to:

Databases -> Documents -> Content Processing

We see a message informing us that Content Processing is not installed. That’s ok.  Click the ‘Install’ tab.

Next we’re presented with an option to ‘enable conversion’.  If you’re running the Community Edition, leave this as false.  Click ‘install’.

Finally, we’re presented with the message ‘Content Processing will be installed for the database test without conversion.’ Click ‘ok’.

Note: There are conversion utilities available that require a separate license.  If you are running a version of the Server other than Community, feel free to install these utilities by setting the enable conversion option to true.  For those on the Community Edition, installing content processing allows us to build our own conversion utilities, as well as take advantage of other available pipelines that don’t require the separate license. w00t!

Now that we’ve installed Content Processing, we just need to attach the Office OpenXML Extract pipeline.

Attach a Content Processing Pipeline

In the Admin UI, navigate to:

Databases -> Documents -> Content Processing -> Domains -> Default Documents -> Pipelines

Make sure that the pipelines ‘Status Change Handling’ and ‘Office OpenXML Extract’ are checked.  Then click ‘ok’.

You’ll now see those pipelines are attached.


Let’s take advantage our freshly configured pipeline. Open CQ and evaluate the following to insert a document.  I’m inserting C:\foo.docx.  The default domain for CPF is the root directory “/”, so remember to prefix the name of your file with “/” in the uri option so the document will be processed.

     xquery version "1.0-ml";
                          <options xmlns="xdmp:document-load">

To validate the pipeline ran for the document, evaluate the following in CQ:

     xquery version "1.0-ml";

Notice in the results returned that A) there are CPF properties on the .docx informing us the XML has been extracted, and B) we are looking at the properties for an extracted Part.  The document.xml is in a sibling directory of the original foo.docx named /foo_docx_parts/.  We now have the original zip package and all its extracted pieces available to us in the Server.  With Content Processing installed and our pipeline configured, anytime we save an Office document to the Server in the future they will be automatically unzipped and their parts will be saved similarly for us in a _parts directory.


Note: The .docx and the parts directory are linked.  Delete the .docx, and the related _parts folder and its extracted pieces will be deleted as well.

So,  What are pipelines?  What’s Content Processing? What was all that configuration we did?

CPF in a Nutshell ( the Content Processing Framework )

MarkLogic Server includes a framework for processing content that we refer to lovingly as CPF.  CPF stands for the Content Processing Framework.  The gist is this: a document has a lifecycle that starts with creation and advances as users/applications update and modify the document.  CPF provides a way to take action on documents based on where they are in their lifecyle.

A pipeline is an XML document that describes a set of content processing steps. It defines the steps that occur during the processing of documents and defines actions that occur at each step.  These actions can be found in supporting XQuery functions and modules. CPF was built for you to create your own content processing applications, with your own content processing code, and following your own logical and business processes.

If you’d like to know more, you can check out the  Content Processing Framework guide, I’ve also provided a quick intro here as well.

Note: the quick intro linked here was written for MarkLogic Server 3.2. For 4.*.* some minor updates are required.  I’ll revisit and update that post in the future, but if you’re interested, the guide should provide enough info to help you modify the example successfully.

But what if we’re not working with Office 2007 documents?  Or we don’t want to use CPF?  Can we still unzip these documents, extract the individual XML files and insert them into our MarkLogic Database?  Yes.

xdmp:zip utilities

When working with .zip files, you’ll want to take a look at the functions xdmp:zip-create(), xdmp:zip-get(), and xdmp:zip-manifest().

I have a Word document, sampleManuscript.docx, that I’ve saved in the directory C:\test.  I can take a look at the names of the files inside the .docx by evaluating the following in CQ:

     xquery version "1.0-ml";

MarkLogic provides utilities for working with files on the filesystem, but let’s load our document into the Server.

Note: For the following examples, we want to insure CPF does not process our documents for us. So navigate to pipelines, as we did in our configuration steps above.  Uncheck the Office OpenXML Extract pipeline, and click ‘ok’.  This will detach the pipeline so it will not action the example document below on load.

     xquery version "1.0-ml";
                          <options xmlns="xdmp:document-load">

The above returns the empty sequence. You can validate that your document inserted properly by clicking ‘explore’ in CQ.  Or by evaluating the following:

     xquery version "1.0-ml";

At a minimum, you’ll see a last-modified metadata timestamp for the document.  If you’ve enabled content processing for the database as we did above, you will see other cpf:* properties.  Assuming no pipelines are attached, the .docx will be in cpf:state initial and no action has been taken to extract its XML parts.

If we want to access a file within the .docx, assuming we know the name of the piece we want  in the .zip,  we can get it using xdmp:zip-get().  The following returns the XML for the document.xml file located within our .docx package.

     xquery version "1.0-ml";

Instead of having to know the names of individual pieces within a .zip package and/or having to extract individual files each time we want to access a piece of XML, let’s just unzip and extract the pieces from our.docx and insert them into a directory, similar to how CPF did for us.

     xquery version "1.0-ml";
     declare namespace zip="xdmp:zip";

     let $doc := "/myManuscript/sampleManuscript.docx"
     let $directory-uri := "/myManuscript/sampleManuscript_docx_parts/"
     let $zipfile := fn:doc($doc)
     let $manifest := xdmp:zip-manifest($zipfile)
     for $part-name in $manifest/zip:part
         let $options := if ($part-name = "/_rels/.rels") then
                             <options xmlns="xdmp:zip-get">
                            <options xmlns="xdmp:zip-get"/>
         let $part := xdmp:zip-get($zipfile, $part-name, $options)
         let $part-uri := fn:concat($directory-uri, $part-name)
         return xdmp:document-insert($part-uri, $part)

We just loop through the manifest, extract the pieces, and insert into a directory. Here we’re explicitly telling the server to treat .rels as XML. With a little modification and refinement, we can easily take the above and make it into a re-usable module. We could even use this as a starting point for creating our own CPF pipeline action.

Zip it!


Now that we’ve extracted XML documents from zip files and saved them to the Server, lets zip ’em back up.

When we extracted the pieces for sampleManuscript.docx above, we saved them to a directory named   /sampleManuscript_docx_parts/ and we specified this as a subdirectory of the folder /myManuscript/.

When we insert a document to MarkLogic, we specify the name of the file we are saving with a uri parameter (xs:string).  If this string has “/”s in it, the strings in between the slashes will be treated as directories. So “/” marks the path for our content’s location on the Server.  By default, MarkLogic is configured for automatic directory creation. This is done as many people like using directories to manage their content, and it can be very helpful when loading documents using webDAV.  It’s also very helpful for managing unzipped XML packages, as we want to retain the original structure of the document, so when we zip the pieces back up the document will open successfully it in its respective application.

We saved our extracted documents pieces to directories, but we don’t know what pieces were inserted, or what subdirectories were created.  If we enable the uri lexicon for our database, we can evaluate a query using cts:uris that will provide us the uris for all the extracted pieces.  We can then use this to create a manifest and zip the pieces back up.

Validate Directory Creation is Automatic

In Admin UI, navigate to:

Databases -> Documents

Scroll down until you see the property ‘directory creation‘.  The dropdown selection is set to ‘automatic‘.

For more info on directories, refer to the Admin’s Guide, and Application Developer’s Guide.

Enable URI Lexicons

Scroll up, and a few properties up from ‘directory creation’ you’ll find ‘uri lexicon‘.  Set the enabled option to ‘true’ and click ‘ok’.

Now head back to CQ, and evaluate the following:

     xquery version "1.0-ml";
     let $directory := "/myManuscript/sampleManuscript_docx_parts/"
     return cts:uris("","document",cts:directory-query($directory,"infinity"))

The results are a list of all the XML documents we extracted from sampleManuscript.docx.  We can zip them back up and save locally by evaluating the following:

     xquery version "1.0-ml";
     let $directory := "/myManuscript/sampleManuscript_docx_parts/"
     let $uris := cts:uris("","document",cts:directory-query($directory,"infinity"))

     let $parts := (for $i in $uris let $x := fn:doc($i) return $x)

     let $manifest := <parts xmlns="xdmp:zip">
                       for $i in $uris
                       let $file := fn:substring-after($i,$directory)
                       let $part :=  <part>{$file}</part>
                       return $part

     let $pkg := xdmp:zip-create($manifest, ($parts))
     return xdmp:save("C:\test.docx",$pkg)

Double-click on the saved file to open it in Word.  I’ve named it test.docx to demonstrate we can name it anything we want too. More likely we would’ve used the document’s original name, sampleManuscript.docx.

Most of the time though, we’ll want users to be able to dynamically generate their documents on the fly.  Place the following code in a module named opendocx.xqy.  Place it under the /Docs directory for the Server, which is found in the directory where MarkLogic is installed. On Windows the default is C:\Program Files\MarkLogic\Docs.

     xquery version "1.0-ml";
     let $directory := "/myManuscript/sampleManuscript_docx_parts/"
     let $uris := cts:uris("","document",cts:directory-query($directory,"infinity"))

     let $parts := (for $i in $uris let $x := fn:doc($i) return $x)

     let $manifest := <parts xmlns="xdmp:zip">
                       for $i in $uris
                       let $file := fn:substring-after($i,$directory)
                       let $part :=  <part>{$file}</part>
                       return $part

     let $filename := "test.docx"
     let $pkg := xdmp:zip-create($manifest, ($parts))

     let $disposition := concat("attachment; filename=""",$filename,"""")
     let $x := xdmp:add-response-header("Content-Disposition", $disposition)
     let $x:= xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
     return $pkg

In a browser, navigate to the url http://localhost:8000/opendocx.xqy.

Assuming Office 2007 is installed on your machine, the document opens right up into its respective application.


We can just open the .docx as well. Save the following under /Docs as opendocx2.xqy.  Update the url in your browser, and the file will again open from the Server into Word.

     xquery version "1.0-ml";
     let $docname := "/myManuscript/sampleManuscript.docx"
     let $pkg := fn:doc($docname)
     let $filename := "test.docx"
     let $disposition := concat("attachment; filename=""",$filename,"""")
     let $x := xdmp:add-response-header("Content-Disposition", $disposition)
     let $x:= xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
     return $pkg

And there you have it!  To the Server and back again!  Kind of like the Hobbit, but you didn’t need Gandalf or a bunch of dwarves to help you make the journey.

That should be more than enough to get you started, and to continue with something super awesome, just ponder this: when you have a solid understanding of the document format you’re working with, you can generate  Word, Excel, PowerPoint, OpenOffice, InDesign4, and any other document type on the Server, and you don’t even need the original application to start with! We can generate these documents dynamically, on-the-fly, and serve ’em up to our users who use these applications, but to us on the Server, it’s just a set of XML and related parts.  Cheers!

Load A Custom UI for Office 2007 from MarkLogic Server

We all know that with Office 2007 the traditional Office file navigation menus and toolbars were replaced with The Ribbon.  The Ribbon a.k.a. RibbonX a.k.a. The Office Fluent Ribbon a.k.a. The Fluent UI a.k.a. …

The naming has evolved over time, but basically the Ribbon is the interface, now called the Fluent UI.  RibbonX is the original name given to the XML that  you can use to customize the Ribbon/Fluent UI to add your own tabs and controls.

Many examples demonstrate how to load a custom Ribbon UI from within a Document-Level customization or an Application-Level Add-in.  These solutions load the XML for the Ribbon from within either the Document itself or the deployed Add-in solution respectively.  But what if we want to keep the XML for our custom Ribbon interface someplace else? How about someplace where we can access and modify it so we can change the user experience in Office without requiring our users to re-install an Add-in or use a new Document template?

It is possible to load a customized Ribbon from an external resource.  Today’s post demonstrates how to load a custom Ribbon for an Application-Level Add-in from Ribbon XML that’s been saved in MarkLogic Server.

Save the Ribbon XML in MarkLogic

To keep it brief, we’ll just assume we’re using the default Documents database and the Docs HTTP Server in MarkLogic.

Open CQ in your browser and evaluate the following:

xquery version "1.0-ml";
<customUI xmlns="">
       <tab id="MyTab" label="My Tab">
         <group id="MyGroup" label="My Group">
           <button id="Button1"
                   label="My Button"
                   screentip="My Button Screentip"
                   imageMso="HappyFace" />

We can manipulate the Ribbon interface with a minimal amount of XML.  The tabs, groups, and controls are actually defined using XML elements. This example creates a new tab,  which includes a new control, a button.  The attributes of controls define callbacks.  Callbacks are the functions that will be called when a user interacts with the controls on our Ribbon. For our button @onAction defines the callback which refers to a function named “OnClick”.

We now need a module that the Add-in can call to fetch the XML for the Ribbon.  Create a file called ribbon.xqy and place it under /Docs.  The contents of the file is simply

xquery version "1.0-ml";

We use xdmp:quote() to return the XML as a string as that’s what our Add-in will parse to create the customization in Office for us.

Create an Add-in for Word

First, create an Add-in project for Word:

  1. Start Visual Studio 2008
  2. From the File menu, select New -> Project
  3. In Project Types, select Office 2007
  4. In Templates, select Word 2007 Add-in
  5. Name your Project, I’m calling mine MyRibbonAddin
  6. click OK

Now that we have our project, we need to add a Ribbon to it.

  1. In the Solution Explorer on the right, right-click the MyRibbonAddin project and select Add -> New Item
  2. In Templates, select Ribbon(XML)
  3. Click Add

This adds a  Ribbon1.cs and Ribbon1.xml to our project.

You might’ve noticed there was a Ribbon(Designer) available in the templates as well. Visual Studio 2008 comes with a Ribbon Designer that allows you to drag and drop components to your Ribbon. You can then use this Ribbon in your deployed Add-in.  You can also export the Ribbon from the Designer to XML, which you may find useful.  We selected Ribbon(XML) as we just want to use the simple Ribbon1.cs that is generated for us.

Now that we’ve started our Ribbon customization, we just need to load the Ribbon from MarkLogic, define our callback so our button will do something, and then enable the Ribbon within our Add-in.

Open Ribbon1.cs and go to the GetCustomUI function.  By default we see its loading the Ribbon1.xml it generated.  Change the function to look like the following:

 public string GetCustomUI(string ribbonID)
   string url = "http://localhost:8000/ribbon.xqy";
   System.Net.WebClient Client = new System.Net.WebClient();
   Client.Credentials = new System.Net.NetworkCredential("user", "password");
   string ribbon = Client.DownloadString(url);

   return ribbon;
   //return GetResourceText("MyRibbonAddin.Ribbon1.xml");

You’ll want to update the network credentials to be the ones you defined when you installed MarkLogic.

In the Ribbon Callbacks #region, define the OnClick function:

public void OnClick(Office.IRibbonControl ribbonControl)

Finally, there’s a TODO section that’s been commented out at the top of Ribbon1.cs.  To enable the Ribbon in the Add-in, copy the following function from the TODO section, paste and uncomment it in ThisAddin.cs, just above the ThisAddin_Startup function:

protected override Microsoft.Office.Core.IRibbonExtensibility CreateRibbonExtensibilityObject()
    return new Ribbon1();

Hit F5 on your keyboard to test the customization in debug mode.  Once Word starts, navigate to “My Tab” and click the Smiley Face “My Button”. Our MessageBox tells us we’ve clicked the button.


Boom! That’s it.  Pretty cool.  We can alter the Office interface by feeding it Ribbon XML at startup. And we can query the XML from an external resource like MarkLogic Server.

Next Steps

The example provided here is quite simple.  But you now have the basic steps required to create your own customization for the Office UI.  There’s a wealth of information on the Ribbon/Fluent UI online.  The Visual Studio 2008 Ribbon designer can help to familiarize you with the basic RibbonX elements as well.

Now that you have a custom UI, you can improve the callback functions to interact with the document a user is currently authoring.  You can also use Ribbon controls to send and receive information from MarkLogic.  In this way you can reuse content stored in MarkLogic in a document as well as save selected text back to MarkLogic for future reuse.

It is possible to dynamically generate a Ribbon customization from a query as well. You may want to do this depending on the content your author is accessing, or based on their approved level of access for that information, or both. Just modify the ribbon.xqy we created earlier in the post.

As an example, in the past I’ve found it useful to dynamically generate dropdown menus based on Word document metadata searches to provide users related reusable content chunks to insert into their documents.


Final Ribbon Notes

The Ribbon customization loads once, whenever you start the Office application.  You can’t change the layout or controls at runtime. You can however, update values within those controls.   So in the search dropdown example above, it is possible to update the contents of the dropdown based on some event, but you can’t change a dropdown to a checkbox while the Office application is running. Capiche?

The good news is that if you’ve generalized your callbacks you can update the ribbon.xqy and any other .xqy files on the server, and the next time a user restarts their Office application, they can have new functionality available to them.  This could be useful so you don’t have to constantly recompile your Add-in and/or send your users a new .msi to install.

How did we get the Smiley Face? There’s a list of icons you can download which provides you ids for several of the icons available throughout Office.  There’s also a list of  control ids available for download. This second list shows you all the built in controls available to you for use in Office.

Which brings us to: Yes, you can change the layout of the Ribbon provided in Office. You can override the functionality of existing controls, and you can remove the existing Ribbon entirely and provide your own, forcing users to only use the controls you provide.  But think of your users and what makes sense.  We want to help them, not hinder them.  You may find the  Office 2007 UI Style guide useful.

Finally, the MarkLogic Toolkit for Word and MarkLogic Toolkit for Excel are both free Application-Level Add-ins and  they are open source.  If you’re interested, these solutions may jump start your development, as well as provide you with other insights into how to make your customizations more useful.  They both provide multiple functions for interacting with documents while users are authoring them.  The Ribbon can be useful, but real estate is limited.  The Toolkit Add-ins make extensive use of a Custom Task Pane to enable authors to integrate information in MarkLogic and Office 2007.  Sometimes the Ribbon, Task Pane combo can be very useful.