XQuery and MarkLogic Developer Blogs

Inspired by this list of Computational Linguistic Blogs, I thought I’d aggregate and share the blogs I follow that provide useful information on working with XQuery and/or MarkLogic Server.

The blogs noted here are alive and active.   I like these as they’re written by people who are actually writing code, building things, and solving problems.  They provide code examples, as well as practical insight.   I give a brief synopsis of the activity, and if the author also provides information on working with MarkLogic, I note that as well.  Finally, if the author is active on twitter, you can click their name to follow.

The following are all awesome and listed in no particular order.

Blogroll

  • alex bleasdale / Developer Notes
  • Self-described as notes on XML and Web Development, he’s been on a MarkLogic and XQuery tear lately.  Lots of useful information here.

  • norman walsh / blog
  • Norm is a shotgun blast of all things X: XML, XProc, XQuery, XSLT and more.  He’s now an Engineer at MarkLogic too.  Posts frequently, and its always good.

  • matt turner / Discovering XQuery
  • Irregular posting, but when published, they’re always good.  The blog archive provides many useful tutorials for working with XQuery as well as MarkLogic Server.

  • kit wallace / blog
  • A person who sees solutions written in other languages, asks himself “how would I do that in XQuery”, figures it out, and shares the code.  Useful, fun, and frequent posts on XQuery , XSLT,  XProc and more.

  • jeni tennison / Musings
  • Excellent posts on XQuery, XML, RDF, and Linked Data published on a regular basis.

  • mattio valentino /Rendition Protocol
  • Posts on MarkLogic, XQuery, and more.  The blog is a collection of his development notes and is full of useful snippets and insight.

Alright, that’s it for now.  If you know of others who share XQuery code, experience and/or information on working with MarkLogic Server on a regular basis, I’d really like to know about them.  Please let me know who they are so I can check them out and maybe add them to the list.

Thanks!

Advertisements

Fun with XQuery, Images encoded as base64 Strings, and Word 2007

or: There and Back Again, A JPEGs Tale.

This is a fun one that comes up every once in awhile.  When you save a Word 2007 document as .xml, Word serializes images as base 64 strings.  It turns out that organizations regularly save Word documents as .xml and they want the ability to view these images in a browser or some other application so they can decide how they’d like to re-use them.  So the first question that comes up is: How can I transform the base 64 string back into an image?

If you want to play along at home, copy the image of Bilbo here and save it in a Word 2007 document: In the Ribbon, select the ‘Insert’ tab, from the ‘Illustrations’ group, choose ‘Picture’, then select  your pic and insert it.  Next: Go to the Button, click ‘Save As’, select ‘Other Formats’, and for the ‘Save as format’ choose ‘Word XML Document (*.xml)’.

Don’t choose the 2003 XML, cause that’s something else.  It similar, but different (cause it’s not the same).

So now, open that bad boy in vi, Visual Studio, or some other editor and take a peek.  I want to take this opportunity to introduce you to the  Flat OPC format. When you save a Word doc as a .docx, you end up with a .zip file that contains all these interrelated .xml files and their associated assets (such as images).  When you save as .xml, you end up with all those same XML parts serialized in a single .xml file, with images serialized as base 64 strings.  This .xml format is known affectionately in Redmond as Flat OPC.  Once you understand just a little about this format, the amount of Word document @$$ you can kick in MarkLogic Server and/or using an Add-in in Word is awesome.

SPOILER ALERT: An upcoming post is going to dive into how we can exploit the Flat OPC format for document re-use.

The main body of content for a Word 2007 document can be found in the document.xml part.  With images, you’ll find a reference to the image in document.xml, but the image will be stored separately as its own part in the document package; as a binary in the .docx, but serialized as base 64 when saved as .xml.  Knowing this, let’s convert the string to an image.

Throw BilboBaggins.xml into MarkLogic Server. (I use WebDAV).  To view the base 64 string, evaluate the following in CQ:

xquery version "1.0-ml";
declare namespace pkg="http://schemas.microsoft.com/office/2006/xmlPackage";

let $doc := fn:doc("/BilboBaggins.xml")/node()[2]
let $image-string :=  $doc/pkg:part[@pkg:name="/word/media/image1.jpeg"]/pkg:binaryData/node()
return  $image-string

Yep, it’s that ugly.  Luckily viewing the image is as simple as:

xquery version "1.0-ml";
declare namespace pkg="http://schemas.microsoft.com/office/2006/xmlPackage";

let $doc := fn:doc("/BilboBaggins.xml")/node()[2]
let $image-string :=  $doc/pkg:part[@pkg:name="/word/media/image1.jpeg"]/pkg:binaryData/node()
return  binary{xs:hexBinary(xs:base64Binary($image-string))}

Now, what about the reverse?  What if we have images in the Server that we want to serialize as base 64 strings?

Take the image you copied at the beginning and save it to MarkLogic.  We can convert it to a base 64 string by evaluating the following in CQ:

xquery version "1.0-ml";
declare namespace ooxml= "http://marklogic.com/ooxml";
declare namespace pkg="http://schemas.microsoft.com/office/2006/xmlPackage";

declare function ooxml:base64-string-to-binary(
  $string as xs:string
) as binary()
{
    binary{xs:hexBinary(xs:base64Binary($string))}
};

declare function ooxml:binary-to-base64-string(
 $node as binary()
) as xs:string
{
      xs:base64Binary(xs:hexBinary($node)) cast as xs:string
};

let $doc := fn:doc("/bilbo-200x200.jpg")/node()
return ooxml:binary-to-base64-string($doc)

Now, the above will work if we just want the base 64 string,  but if we want a string we can use with Word and the Flat OPC format, certain rules apply: 1) the string must be broken into lines of 76 characters, and 2) there must not be a line break at the beginning or end of the content.  No big deal, we just do the following:

xquery version "1.0-ml";
declare namespace ooxml= "http://marklogic.com/ooxml";
declare namespace pkg="http://schemas.microsoft.com/office/2006/xmlPackage";

declare function ooxml:base64-string-to-binary(
  $string as xs:string
) as binary()
{
    binary{xs:hexBinary(xs:base64Binary($string))}
};

declare function ooxml:binary-to-base64-string(
 $node as binary()
) as xs:string
{
      xs:base64Binary(xs:hexBinary($node)) cast as xs:string
};

declare function ooxml:base64-opc-format(
$binstring as xs:string)
{
      fn:string-join(ooxml:format-binary($binstring),"
")
};

declare function ooxml:format-binary(
$binstring as xs:string
)as xs:string*
{
    for $i in 0 to (fn:string-length($binstring) idiv 76)
    let $start := ($i * 76)
    return fn:substring($binstring,$start,76)
};

let $doc := fn:doc("/bilbo-200x200.jpg")/node()
return   ooxml:base64-opc-format(ooxml:binary-to-base64-string($doc))

And that’s all there is to it!  You are now a Master of the image-encoded-as-base64-string Universe!  Cheers!

Saving macro code from an Active Excel Workbook as text in a Custom XML Part

Eric B. and RakimSo we can audit and search it later!

Interesting question came my way last week: Is it possible to save macro code as text in an Excel 2007 .xlsm package?

The answer is: Yes!

Why would someone want to do this?  Well, this particular organization wants to save Excel workbooks to MarkLogic Server.  In the Server, the workbooks are automatically unzipped and they can search and run reports on the SpreadsheetML, running analytics on their analytics.

But macros are saved as .bin files within the .xlsm (in vbaProject.bin) and they want the ability to audit the macros being used, as well as search and possibly re-use them.

So wouldn’t it be dreamy if we could create an Add-in for Excel that could capture when an Analyst saves a workbook and serialize the code for any macros in the workbook as text, storing them in Custom XML Parts within the .xlsm package?

Recipe for Awesome

We can prototype this quickly using the MarkLogic Toolkit for Excel.

I’m just going to focus on the macro to text part for this post.  I choose to handwave at events right now as that’s a topic for another time.  There are many you could possibly capture, but let’s just say we (handwave, handwave) capture the WorkbookBeforeSave event.  When that event fires, we’d call some function that uses the code below to get the text for each macro, XML-ize it, and add it to the .xlsm package.  Score-Double-Bonus-Action: the Add-in that comes with the Toolkit already has a function addCustomXMLPart(string myxml), that provides a way for us to add the Custom XML Parts to the .xlsm.

I used Excel to record a simple macro that merges a couple of cells and does some formatting. To tell you the truth, what I know about macros in Office is more meta than practical, so this was an interesting learning exercise for me.  I created my macro using this tutorial, which I found useful.

All that’s left is to create two functions: one to get the count of the macros in the workbook, and another to get each macro and serialize it as text:

//you probably need to add the reference
using VBA = Microsoft.Vbe.Interop;

public int getMacroCount()
{

  VBA.VBProject proj = Globals.ThisAddIn.Application.ActiveWorkbook.VBProject;
  return proj.VBComponents.Count;

}

public string getMacroText(int idx)
{

  VBA.VBProject proj = Globals.ThisAddIn.Application.ActiveWorkbook.VBProject;
  string projName = proj.Name;
  string componentFile = "";
  object o_idx = idx;

  try
  {
    VBA.VBComponent vbComponent = proj.VBComponents.Item(o_idx);

    if (vbComponent != null)
    {
      VBA.CodeModule componentCode = vbComponent.CodeModule;
      componentFile = "";
      if (componentCode.CountOfLines > 0)
      {
        for (int i = 0; i < componentCode.CountOfLines; i++)
        {
          componentFile += componentCode.get_Lines(i + 1, 1) + Environment.NewLine;
        }
      }
    }
  }

  catch (Exception e)
  {
    componentFile = "error: "+e.Message;
  }

  return componentFile;

}
 

So we call the count, and for each one, we get the macro text, if there is any.  If a macro exists, we can just wrap it in some well formed XML and add it to the .xlsm package using addCustomXMLPart(ourmacroinxml).

Now once we’ve saved the .xlsm toMarkLogic, we can query from CQ:

     xquery version "1.0-ml";
     declare namespace xl="http://marklogic.com/xl";
     cts:search(//xl:metadata, cts:word-query("format"))

And view our results:

Notes

MSDN documentation isn’t the greatest to begin with, but I couldn’t find anything useful in the library on Microsoft.Vbe.Interop, VBProject, or VBComponents.

I did find some older blog posts and some suggestions on the forums.  Consensus seems to be that you have to rely on intellisense and just play around with these objects to figure them out.  Of course, after monkeying around with this stuff for a couple of hours I then knew what to look for and found some useful posts on stackoverflow.  But if anyone knows where I can find decent API documentation or any useful information on using  Microsoft.Vbe.Interop, that would be much appreciated.

Finally, the example here is pretty simple and you can easily add more metadata about the .xlsm and/or any embedded macros to the Custom XML Part.   Take a closer look at componentCode (vbComponent.CodeModule) above.   You’ll find properties that return the procedure name, signature, line counts, and more, which could be useful for reporting and diffing.  Enjoy!

public int getMacroCount()

{

int count=0;

try

{

VBA.VBProject proj = Globals.ThisAddIn.Application.ActiveWorkbook.VBProject;

count = proj.VBComponents.Count;

}

catch

{

count = 0;

}

return count;

}

public string getMacroText(int idx)

{

VBA.VBProject proj = Globals.ThisAddIn.Application.ActiveWorkbook.VBProject;

string projName = proj.Name;

string componentFile = “”;

object o_idx = idx;

try

{

VBA.VBComponent vbComponent = proj.VBComponents.Item(o_idx);

if (vbComponent != null)

{

VBA.CodeModule componentCode = vbComponent.CodeModule;

componentFile = “”;

if (componentCode.CountOfLines > 0)

{

for (int i = 0; i < componentCode.CountOfLines; i++)

{

componentFile += componentCode.get_Lines(i + 1, 1) + Environment.NewLine;

}

}

}

}

catch (Exception e)

{

componentFile = “error: “+e.Message;

}

return componentFile;

}