Generating Excel Charts with MarkLogic

GIVE MY CREATION LIFE!and some fun with Formulas too…

This is another interesting one I see regularly: “How can I generate Excel Charts from MarkLogic Server?”

Charts are actually rendered from DrawingML found in the .xlsx package.  The DrawingML is embedded in SpreadsheetML, which is the Open XML format for Excel 2007/2010.

You don’t want to mess with DrawingML, as it’s a nasty frickin riddle, wrapped in an engima, inside a russian doll style matrix of insanity and pain.

Word, Excel and PowerPoint are producers and consumers of XML.  To some extent and to varying degrees, each of their respective XML formats can be understood and worked with in a relatively straightforward and reasonable way.  Sometimes though, the XML generated by these applications is really just a serialization of their object model and you’ll just waste a ton of time and find yourself in an extremely uncomfortable place (ed-like the back of a Volkswagen?) trying to figure the XML out when you don’t have to.  So let’s leave the DrawingML be. Capisci?

Think about it this way:  A chart in a workbook is tied to certain cell values in a worksheet.  When the cell values update, the chart dynamically updates.  At the end of the day, the DrawingML is just a snapshot of what the chart looked like based on the cell values when the Workbook was saved in Excel. (ed-Pivot tables work similarly in this way, but that’s a post for another day.)

Now let’s say we have a workbook containing a chart.  We know we can save our .xlsx to MarkLogic Server and have it automatically unzipped for us, its component XML parts made immediately available for search and re-use.   We can then update our extracted worksheets in the Server using XQuery.  Finally, we can re-zip the extracted workbook components back up and open the updated .xlsx into Excel.  Excel will automatically refresh its chart for us when it consumes the XML so we see the latest visualization of our chart based on the information we added to the worksheets.

5 Steps to Chart Freedom

Step 1

Create your chart in a workbook and drive it off of some cell values.  Note the cells and the name of the worksheet you’re driving your chart from. (example: Sheet1, cells: B2, B3, B4, etc.) I’ve provided a sample .xlsx here.

On Sheet1 we see download counts for a fictional company’s widgets for the month of September. The chart shows downloads for the widgets Foo, Bar, and Wumpus.  The chart columns correspond to cells B2, B3, and B4.

On Sheet2 we see sales counts for a fictional company’s widgets for each salesperson. The chart shows total sales for each salesperson.  The chart sections correspond to cells E2, E3, and E4.  Look closer and you’ll see that the cell values in column E driving the chart are actually the result of formulas; they are SUMs of all widgets for each salesperson row.  Note that the cells B6, C6, D6, and E6 all contain SUM formulas for their respective columns as well.

Step 2

Enable the Office OpenXML Extract and Status Change Handling CPF pipelines for your MarkLogic database so the .xlsx will automatically be unzipped when ingested into MarkLogic and its component parts made available for update.  Also insure you have the URI Lexicon enabled for your database. An example how to set this up can be found here.

Step 3

Save your .xlsx to MarkLogic. Once saved, the .xlsx is unzipped, and we can now manipulate it’s extracted XML component parts directly.  The idea is to save workbooks containing your charts as templates within MarkLogic and then update the extracted worksheet parts based on new information being saved to your database.

Step 4

Use the XQuery API that comes with the MarkLogic Toolkit for Excel to set the cell values for your chart in the extracted worksheet.  In particular, look at the function excel:set-cells() for updating worksheets.  Evaluate the following in CQ.

Note: you may need to update the code samples below to reflect your workbook and where you saved it in MarkLogic.

xquery version “1.0-ml”;

import module namespace excel=”” at “/MarkLogic/openxml/spreadsheet-ml-support.xqy”;

let $doc1 := “/MySpreadsheet_xlsx_parts/xl/worksheets/sheet1.xml”
let $doc2 := “/MySpreadsheet_xlsx_parts/xl/worksheets/sheet2.xml”
let $sheet1:= fn:doc($doc1)/node()
let $sheet2 := fn:doc($doc2)/node()

let $cell1 := excel:cell(“B2”,120)
let $cell2 := excel:cell(“B3”,99)
let $cell3 := excel:cell(“B4”,456)

let $cell4 := excel:cell(“D3”,127)
let $cell5:= excel:cell(“E3″,(),”SUM(B3:D3)”)

return (xdmp:document-insert($doc1, excel:set-cells($sheet1,($cell1, $cell2, $cell3))),
                xdmp:document-insert($doc2, excel:set-cells($sheet2,($cell4, $cell5))))

In the code above, for Sheet1, we see that we use the excel:cell() constructor to create cells for B2, B3, and B4.  We set the values for these cells to new numbers. These numbers could be coming from the results of another query.  We update the worksheet, using excel:set-cells(), passing the function the sheet we want to update, as well as a sequence of cells we’d like added and/or updated on the referenced sheet.  Finally, we xdmp:document-insert() our updated document, overwriting the existing one with our updated worksheet.  Remember, Sheet1 just held the simple chart driven directly from the cell values.

With Sheet2, we again use excel:cell() to create cells for D3 and E3. Sheet2 is more interesting as the chart here is driven from cells that contain formulas. For E3, we create a cell using excel:cell(), setting the value of the cell to the empty sequence, () , and passing in the formula for the cell.  Again we excel:set-cells() to update our worksheet and xdmp:document-insert() to save our updated worksheet back to the Server.

Note on excel:cell(): This function creates a new cell, so if you wish to retain an existing formula for a cell before you update it in a worksheet, you can’t use the 2 argument excel:cell() function.  If you did that, you’d lose the formula for the cell in the worksheet when you overwrite the XML.  You must create the cell with the formula, as we did above for E3.  If this doesn’t work for you, you can always roll your own XQuery to update the cell values for worksheets containing formulas in a different way.

Note on Excel formulas: Unlike charts, cells containing formulas will not calculate and refresh automatically when you open the updated worksheet in Excel if those cells already contain values. The value of the cell within the XML for the worksheet is considered the cached value by Excel and will be displayed when the workbook is opened.  This is done for performance reasons, so formula heavy worksheets don’t take forever to open as they calculate the value for every cell containing a formula when a workbook is opened.  Formula calculation is postponed to avoid wait time when opening a workbook.  As a result of this though, you can create XML for a worksheet that when consumed by Excel, will result in a cell displaying the wrong results given its formula.

To get a formula to calculate the value for a cell when you open a workbook in Excel and insure the correct cell value is displayed, you need to set the value of the cell to nothing.  You can do this using excel:cell(), setting the value of the cell to the empty sequence: ().

For more information on the excel:* functions,  check out the XQuery API docs that come with the Toolkit for Excel.  There are a lot of functions available, all documented and with examples of usage.

Step 5

Zip up the updated .xlsx from it’s extracted component parts and open into Excel.  When you do this, it doesn’t matter what the DrawingML is.  Excel reads the cell values when it consumes the XML and will update the chart automatically.  The next time you save the workbook, the DrawingML is updated to reflect what the chart looks like based on the latest cell values. Evaluate the following in CQ.

xquery version “1.0-ml”;

let $directory := “/MySpreadsheet_xlsx_parts/”
let $uris := cts:uris(“”,”document”,cts:directory-query($directory,”infinity”))
let $parts := for $i in $uris let $x := fn:doc($i) return  $x

let $manifest := <parts xmlns=”xdmp:zip”>
                              for $i in $uris
                              let $dir := fn:substring-after($i,$directory)
                              let $part :=  <part>{$dir}</part>
                              return $part

let $xlsx := xdmp:zip-create($manifest, $parts)
return xdmp:save(“C:\MyUpdatedSheet.xlsx”,$xlsx)

Open MyUpdatedSheet.xlsx into Excel.

BooYaa!  We update a few cells on Sheet1, and our chart automatically updates for us when we open the .xlsx into Excel.

Now take a look at Sheet2.  We updated D3 and set the value of E3 to (). Subsequently, the formula in E3 calculated its SUM formula when the workbook was opened.  Since the chart is driven from E2, E3, and E4, it updated properly as well.  WooHaa!

But take a closer look at cells D6 and E6.  They each contain SUM formulas for their columns, and they’re displaying the wrong values!  (ed-#fail) This is because we didn’t set their values to nothing.  Since the cells contained values in the XML for the worksheet, the cell formulas were not calculated when the workbook was opened and the cached value was displayed.  If you click on each of those cells, you’ll see the formula, click off of them, and the cells will recalculate and update with the correct values.

Bring Excel Workbooks to Life!

There's always another way.So the title was a bit misleading, as we don’t actually generate charts, so much as create the appropriate XML for Excel worksheets so that the Excel application will update and render charts for us when it consumes the XML.   But understanding a little bit of the SpreadsheetML format and how Excel behaves when consuming XML for charts and formulas, the doors open up to some very interesting possibilities.

The above examples are intentionally simple, but think about this…

Instead of a dead .xlsx, that sits lifeless on the filesystem and only alive when opened and active and being manipulated directly in Excel,  workbooks can now stay alive in the Server, constantly updated from complex queries being evaluated as new information is saved to the database.  These workbooks can then be dynamically zipped up when called upon to open in Excel and provide snapshot visualizations and results for a point in time.  Excel will consume the XML and can update charts and calculate formulas when opening this snapshot workbook, while the underlying, extracted workbook lives on and continues to be updated.

But this is just one way to use MarkLogic and Excel together.  There’s always another way…


MarkLogic Toolkits for Word, Excel, and PowerPoint

I thought I’d use this post to provide a brief introduction to the MarkLogic Toolkits for Office.  So here’s an overview:

What is a Toolkit?

A Toolkit is a set of tools for jumpstarting your development with MarkLogic Server and Microsoft Office 2007 / Office 2010 / Open XML.

There are currently 3 Office Toolkits:

  1. MarkLogic Toolkit for Word
  2. MarkLogic Toolkit for Excel
  3. MarkLogic Toolkit for PowerPoint

We care about Word, Excel, and PowerPoint, because with Office 2007, their respective document formats are now XML.  Take a .docx, .xlsx, or .pptx and change it’s file extension to .zip.  Extract the file and inside you’ll find a bunch of interrelated XML parts.

This update to the document formats provides an interesting opportunity as people can now work with XML without learning new, specialized tools, or even really being aware of the fact that they’re working with XML.  Authors continue to use the tools they know and are familiar with in Office, and we can provide additional functionality to them by taking advantage of the XML.  The Toolkits provide ways for us to enhance the authoring experience within Office as well as on the Server where we can prepare content for Office as well as additional consumers.

Each Toolkit is composed of 3 major components:

  1. Add-in for Word | Excel | PowerPoint
  2. XQuery API
  3. Sample Applications

Add-in with supporting JavaScript API

The Add-in is just a standard Windows application you install using a .msi.  Double-click the .msi to start installation, click next, next, next, through the dialog screens as you would with any Windows app, and the next time you start Office you’ll find a Task Pane on the right hand side of the application (see image below).

NOTE: The Task Pane is just a browser!  It’s using whatever version of IE is installed on the client, and exposing that within Office.

The Addin may just be a browser, but it also installs a supporting library for interacting with the active document (the document being authored).   Access to this libary from the browser is available from the JavaScript API that comes with the Addin.  Developers can quickly create a webapp within Word, Excel, or PowerPoint that communicates with and/or is even served from MarkLogic Server and they can use the JavaScript APIs to get XML in and out of the document being authored.

We wanted to avoid creating a situation where users had to constantly re-install Add-ins on the client.  By making it a browser, we can update functionality by simply changing the application code on the Server.

JSDocs are provided with each Toolkit for the respective JavaScript API.

XQuery API

The XQuery APIs for Word, Excel, and PowerPoint provide functions for developers to manipulate and generate Office documents on the Server.

The goal is to simplify the use of Open XML.  Along with each XQuery API,   a CPF pipeline for MarkLogic is also provided that will automatically update Office documents on the Server as they are ingested to make all the XML more friendly for search and reuse.  These updates are done without using any custom XML and without losing any document fidelity.

NOTE: The Add-in and XQuery API can work in concert or separately.  If your authors use Office, it might make sense to use both. If you’re querying Office documents (or some other XML format) on the Server, and delivering results through a regular browser or some other consumer, you might not need Office on the client at all.  But if you’re delivering an Office document as a result, yes,  you still may require the Office application on the client, but you don’t necessarily require Add-in.  It all depends on your use-cases and particular goals.

XQuery API docs are provided with each TK as well.

Sample Applications

Rather than just give developers an empty browser with a .js file and JavaScript API documentation to start development with, each TK comes with Sample Applications.  A developer can just drop the Sample right into MarkLogic, configure their Addin to reference the URL of the Server, and quickly be up and running with applications within the Task Pane.

These Samples are VERY simple.  They provide just a sliver of the available API functionality.  Again, they’re intended to jumpstart development.  We provide these samples so a developer can quickly see some useful functionality, open the source to see how the code looks, and get in there and start hacking to create the app they actually want.  Developers can reference the API docs and add/change functionality as they require.  When you look at the docs you’ll see that there is a LOT that can be done on the client and in the Server that isn’t demonstrated in the Samples at all.

NOTE: The Sample Application is not the Toolkit.  It’s just one example of the type of application you can build using a TK.

NOTE: The Sample Application is not Office.  We skinned the samples to be the colors of Office.  But it’s just HTML, JavaScript, and CSS.  Remember, the Pane is just a browser serving up pages from MarkLogic.  The goal is to keep authors comfortable in their authoring environment, letting MarkLogic do what it does best (search, reuse, enrich, analyze, etc.) and let Office do what it does best (author,analyze,present).  If you want to use crazy colors and  the blink tag for your app, go for it!

A Toolkit Guide rounds out the documentation with details on creating, configuring, and delivering solutions that use the Toolkits.


Office is ubiquitous.  The goal of the Toolkits is to keep authors authoring and analysts analyzing  in the tools they are already using and comfortable with.  Office is a publisher and consumer of XML, MarkLogic is an XML Server.  The products compliment each other very nicely and we can create a much richer Office experience for authors without requiring them to learn new, custom tools, or even be aware of the fact that behind the scenes, it’s all XML.


The Toolkits are all free and now available on codeplex.  They are all open source, released under the Apache 2 license.

The response to the TKs has been very positive.  I’ve seen an increase in interest lately, and it’s been great to hear people are using these and finding them very useful.  I was surprised to hear at #MLUC10 how one person has deployed all 3 TKs across his organization and is very excited about the possibilities.  He also told me that multiple authors are enjoying the Sample apps in PowerPoint as-is.  Very cool!

I just have to say, I get excited too!  Each Office application has a different degree of XML friendliness and Word is by far the friendliest.  With the Toolkit for Word we can use Word as a browser into MarkLogic.  At work I send content back and forth between Word and MarkLogic and never have to save a local .docx on the client.  It’s just XML going back and forth.  Office consumes and publishes Open XML.  Using the XQuery API, I can dynamically create the XML Word requires for consumption from alternative XML formats. Also, Word publishes WordprocessingML, but my destination XML format isn’t necessarily always Office docs.  It’s pretty awesome, and that’s just Word!  Similar opportunities exist for Excel and PowerPoint as well.

Spoiler Alert: there’s more awesome coming!

So that’s it. You’re now Toolkit experts.  Go download the Toolkits and have fun creating your own MarkLogic applications for Office.  If you have any questions , comments, or suggestions for the TKs please feel free to drop me a line in the comments. Thanks!

XQuery and MarkLogic Developer Blogs

Inspired by this list of Computational Linguistic Blogs, I thought I’d aggregate and share the blogs I follow that provide useful information on working with XQuery and/or MarkLogic Server.

The blogs noted here are alive and active.   I like these as they’re written by people who are actually writing code, building things, and solving problems.  They provide code examples, as well as practical insight.   I give a brief synopsis of the activity, and if the author also provides information on working with MarkLogic, I note that as well.  Finally, if the author is active on twitter, you can click their name to follow.

The following are all awesome and listed in no particular order.


  • alex bleasdale / Developer Notes
  • Self-described as notes on XML and Web Development, he’s been on a MarkLogic and XQuery tear lately.  Lots of useful information here.

  • norman walsh / blog
  • Norm is a shotgun blast of all things X: XML, XProc, XQuery, XSLT and more.  He’s now an Engineer at MarkLogic too.  Posts frequently, and its always good.

  • matt turner / Discovering XQuery
  • Irregular posting, but when published, they’re always good.  The blog archive provides many useful tutorials for working with XQuery as well as MarkLogic Server.

  • kit wallace / blog
  • A person who sees solutions written in other languages, asks himself “how would I do that in XQuery”, figures it out, and shares the code.  Useful, fun, and frequent posts on XQuery , XSLT,  XProc and more.

  • jeni tennison / Musings
  • Excellent posts on XQuery, XML, RDF, and Linked Data published on a regular basis.

  • mattio valentino /Rendition Protocol
  • Posts on MarkLogic, XQuery, and more.  The blog is a collection of his development notes and is full of useful snippets and insight.

Alright, that’s it for now.  If you know of others who share XQuery code, experience and/or information on working with MarkLogic Server on a regular basis, I’d really like to know about them.  Please let me know who they are so I can check them out and maybe add them to the list.


Fun with XQuery, Images encoded as base64 Strings, and Word 2007

or: There and Back Again, A JPEGs Tale.

This is a fun one that comes up every once in awhile.  When you save a Word 2007 document as .xml, Word serializes images as base 64 strings.  It turns out that organizations regularly save Word documents as .xml and they want the ability to view these images in a browser or some other application so they can decide how they’d like to re-use them.  So the first question that comes up is: How can I transform the base 64 string back into an image?

If you want to play along at home, copy the image of Bilbo here and save it in a Word 2007 document: In the Ribbon, select the ‘Insert’ tab, from the ‘Illustrations’ group, choose ‘Picture’, then select  your pic and insert it.  Next: Go to the Button, click ‘Save As’, select ‘Other Formats’, and for the ‘Save as format’ choose ‘Word XML Document (*.xml)’.

Don’t choose the 2003 XML, cause that’s something else.  It similar, but different (cause it’s not the same).

So now, open that bad boy in vi, Visual Studio, or some other editor and take a peek.  I want to take this opportunity to introduce you to the  Flat OPC format. When you save a Word doc as a .docx, you end up with a .zip file that contains all these interrelated .xml files and their associated assets (such as images).  When you save as .xml, you end up with all those same XML parts serialized in a single .xml file, with images serialized as base 64 strings.  This .xml format is known affectionately in Redmond as Flat OPC.  Once you understand just a little about this format, the amount of Word document @$$ you can kick in MarkLogic Server and/or using an Add-in in Word is awesome.

SPOILER ALERT: An upcoming post is going to dive into how we can exploit the Flat OPC format for document re-use.

The main body of content for a Word 2007 document can be found in the document.xml part.  With images, you’ll find a reference to the image in document.xml, but the image will be stored separately as its own part in the document package; as a binary in the .docx, but serialized as base 64 when saved as .xml.  Knowing this, let’s convert the string to an image.

Throw BilboBaggins.xml into MarkLogic Server. (I use WebDAV).  To view the base 64 string, evaluate the following in CQ:

xquery version "1.0-ml";
declare namespace pkg="";

let $doc := fn:doc("/BilboBaggins.xml")/node()[2]
let $image-string :=  $doc/pkg:part[@pkg:name="/word/media/image1.jpeg"]/pkg:binaryData/node()
return  $image-string

Yep, it’s that ugly.  Luckily viewing the image is as simple as:

xquery version "1.0-ml";
declare namespace pkg="";

let $doc := fn:doc("/BilboBaggins.xml")/node()[2]
let $image-string :=  $doc/pkg:part[@pkg:name="/word/media/image1.jpeg"]/pkg:binaryData/node()
return  binary{xs:hexBinary(xs:base64Binary($image-string))}

Now, what about the reverse?  What if we have images in the Server that we want to serialize as base 64 strings?

Take the image you copied at the beginning and save it to MarkLogic.  We can convert it to a base 64 string by evaluating the following in CQ:

xquery version "1.0-ml";
declare namespace ooxml= "";
declare namespace pkg="";

declare function ooxml:base64-string-to-binary(
  $string as xs:string
) as binary()

declare function ooxml:binary-to-base64-string(
 $node as binary()
) as xs:string
      xs:base64Binary(xs:hexBinary($node)) cast as xs:string

let $doc := fn:doc("/bilbo-200x200.jpg")/node()
return ooxml:binary-to-base64-string($doc)

Now, the above will work if we just want the base 64 string,  but if we want a string we can use with Word and the Flat OPC format, certain rules apply: 1) the string must be broken into lines of 76 characters, and 2) there must not be a line break at the beginning or end of the content.  No big deal, we just do the following:

xquery version "1.0-ml";
declare namespace ooxml= "";
declare namespace pkg="";

declare function ooxml:base64-string-to-binary(
  $string as xs:string
) as binary()

declare function ooxml:binary-to-base64-string(
 $node as binary()
) as xs:string
      xs:base64Binary(xs:hexBinary($node)) cast as xs:string

declare function ooxml:base64-opc-format(
$binstring as xs:string)

declare function ooxml:format-binary(
$binstring as xs:string
)as xs:string*
    for $i in 0 to (fn:string-length($binstring) idiv 76)
    let $start := ($i * 76)
    return fn:substring($binstring,$start,76)

let $doc := fn:doc("/bilbo-200x200.jpg")/node()
return   ooxml:base64-opc-format(ooxml:binary-to-base64-string($doc))

And that’s all there is to it!  You are now a Master of the image-encoded-as-base64-string Universe!  Cheers!