Get the code:

The code in this blog post is now available at:

http://github.com/tealeg/FODS2XML

Background

For a while now I’ve been receiving data in Excel files that needed to be integrated into some better structured XML data. Originally I was using a little xls2csv from the catdoc package in Arch Linux to pull the data from the horrible bloated XLS files into something usable.

The trouble is the conversion to CSV was a little error prone and it’s really hard to see such errors in a 40MB CSV file. What I really needed was something a bit more structured to work with. I decided I would try processing the data directly from an XML version of the spreadsheet. In OpenOffice.org 3 I have a few XML options (Microsoft XLSX, Open Document Format, and FODS) - I wasn’t aware of FODS XML before so I took a look. What FODS seems to be is a single file equivalent of the XML content that usually spans multiple files in the Zipped up folder that is an Open Document Format file. (I may be wrong - but this is my naive view of what I saw).

Converting FODS with XSLT

With the handy FODS verison of the spreadsheet I set about writing a two stage XSLT conversion that strips out the data and results in a meaningful XML file. This is all quick and dirty XSLT 2.0 - I’ve only tested in Saxon 9.2.0.6 and I wouldn’t put it forwards as an exemplary piece of XSLT 2.0, but I figured someone might find is useful.

My code assumes the following:

  • The data is arranged in tabular format
  • There is only one table per tab in the sheet
  • There is no extraneous data around the tab
  • The first row contains column titles that are also valid XML element names (i.e. no spaces or other restricted characters).
  • Stage 1 simplified the FODS data too a simple tabular XML format. The trickiest thing here is the expansion of repeated columns. Note as well, this XSLT 2.0 takes a parameter “collimit” that tells it to ignore columns that fall after that number - this makes the XSLT run a lot faster.

fods2tbl.xsl looks like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet
  version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
  xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
  xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
  xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
  xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
  xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0"
  xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
  xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0"
  xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
  xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0"
  xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0"
  xmlns:math="http://www.w3.org/1998/Math/MathML"
  xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0"
  xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0"
  xmlns:config="urn:oasis:names:tc:opendocument:xmlns:config:1.0"
  xmlns:ooo="http://openoffice.org/2004/office"
  xmlns:ooow="http://openoffice.org/2004/writer"
  xmlns:oooc="http://openoffice.org/2004/calc"
  xmlns:dom="http://www.w3.org/2001/xml-events"
  xmlns:xforms="http://www.w3.org/2002/xforms"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:rpt="http://openoffice.org/2005/report"
  xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2"
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
  xmlns:grddl="http://www.w3.org/2003/g/data-view#"
  xmlns:tableooo="http://openoffice.org/2009/table"
  xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0"
  xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0"  
  xmlns:css3t="http://www.w3.org/TR/css3-text/"
  xmlns:saxon="http://icl.com/saxon"
  extension-element-prefixes="saxon">

<xsl:output indent="yes"
       method="xml"
       encoding="UTF-8"
       saxon:character-representation="native;decimal"/>

<xsl:param name="collimit" select="100"/>

<xsl:template match="*">
  <!-- Do Nothing -->
</xsl:template>

<xsl:template match="office:document">
  <document>
    <xsl:apply-templates />
  </document>
</xsl:template>


<xsl:template match="office:body">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="office:spreadsheet">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="table:table">
  <table>
    <xsl:attribute name="name">
 <xsl:value-of select="@table:name"/>
    </xsl:attribute>
    <xsl:apply-templates />
  </table>
</xsl:template>

<xsl:template match="table:table-row">
  <row>
    <xsl:attribute name="rownum">
 <xsl:value-of select="count(preceding-sibling::table:table-row) + 1"/>
    </xsl:attribute>
    <xsl:apply-templates />
  </row>
</xsl:template>

<xsl:template name="repeat-cells">
  <xsl:param name="index" select="0"/>
  <xsl:param name="cellnum" select="1"/>
  <xsl:if test="$collimit >= $cellnum">
    <xsl:if test="$index > 0">
 <cell>
   <xsl:attribute name="cellnum">
     <xsl:value-of select="$cellnum"/>
   </xsl:attribute>
 </cell>
 <xsl:call-template name="repeat-cells">
   <xsl:with-param name="index" select="$index - 1"/>
   <xsl:with-param name="cellnum" select="$cellnum + 1"/>
 </xsl:call-template>
    </xsl:if>
  </xsl:if>
</xsl:template>

<xsl:template match="table:table-cell">
  <xsl:variable name="repeats"
    select="sum(preceding-sibling::table:table-cell/@table:number-columns-repeated) -
     count(preceding-sibling::table:table-cell[@table:number-columns-repeated > 0])"/>
  <xsl:variable name="cellnum" select="count(preceding-sibling::table:table-cell) + $repeats + 1"/>
  <xsl:if test="$collimit >= $cellnum">
    <cell>
 <xsl:attribute name="cellnum">
   <xsl:value-of select="$cellnum"/>
 </xsl:attribute>
 <xsl:apply-templates />
    </cell>
    <xsl:if test="@table:number-columns-repeated > 0">
 <xsl:call-template name="repeat-cells">
   <xsl:with-param name="index" select="@table:number-columns-repeated -1"/>
   <xsl:with-param name="cellnum" select="$cellnum + 1"/>
 </xsl:call-template>
    </xsl:if>
  </xsl:if>
</xsl:template>

<xsl:template match="text:p">
  <p>
    <xsl:value-of select="."/>
  </p>
</xsl:template>
  
</xsl:stylesheet>

The second stage is to take the resulting tabular XML file and convert it to meaningful XML. The output essentially wraps each cell in a XML element with the name of the column it was taking from (extracted from the first row).

tbl2data.xsl looks like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

 <xsl:output indent="yes"
       method="xml"
       encoding="UTF-8"
       xmlns:saxon="http://icl.com/saxon"
       saxon:character-representation="native;decimal"/>

 <xsl:template match="*">
   <!-- Do Nothing -->
 </xsl:template>

 <xsl:template match="document">
   <subjects>
     <xsl:apply-templates select="table/row[@rownum > 1]" />
   </subjects>
 </xsl:template>

 <xsl:template match="row">
   <record>
     <xsl:apply-templates />
   </record>
 </xsl:template>

 <xsl:template match="cell">
   <xsl:variable name="cellnum">
     <xsl:value-of select="@cellnum"/>
   </xsl:variable>
   <xsl:variable name="table">
     <xsl:value-of select="ancestor::table/@name"/>
   </xsl:variable>
   <xsl:variable name="cellname">
     <xsl:value-of select="/document/table[@name=$table]/row[@rownum=1]/cell[@cellnum=$cellnum]/p[1]"/>
   </xsl:variable>
   <xsl:choose>
     <xsl:when test="string-length($cellname) > 0">
 <xsl:element name="{$cellname}">
   <xsl:apply-templates />
 </xsl:element>
     </xsl:when>
     <xsl:otherwise>
 <xsl:element name="OOPS">
   <xsl:apply-templates />
 </xsl:element>
     </xsl:otherwise>
   </xsl:choose>
 </xsl:template>

 <xsl:template match="p">
   <xsl:value-of select="."/>
 </xsl:template>

</xsl:stylesheet>

I hope someone finds that helpful.