README for spreadsheet file r/w access scripts for Octave (> 3.8.x)
Copyright (C) 2009 - 2020 Philip Nienhuis <prnienhuis at users.sf.net>
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
February 22, 2020
SPREADSHEET I/O SUPPORT FOR OCTAVE
> > > > AN IMPORTANT THING TO KEEP IN MIND < < < <
The spreadsheet I/O functions merely serve for data I/O, or in other words, just transferring data from or to spreadsheet files.
More sophisticated stuff like formatting or overhauling worksheet setup is simply out of scope.
General overview of functions and files
As of io-2.6.0, the separate functions for .ods and .xls(x) formats have been unified into the xls* functions. The ods* functions still live on as wrappers for their xls* siblings, although they are now formally deprecated. This means that from now on all spreadsheet formats that used to be divided over the ods* and xls* functions can be read / written with either ods* or xls* functions. In the overview below the ods* siblings are indicated between parentheses after their xls* counterparts.
This move was badly needed to be able to cut down on maintenance, as the various ods* functions did largely the same as their xls* siblings (think of input validation, preprocessing, input argument handling, file handling, etc.).
The choice for keeping xls* functions and deprecating the ods* functions is a logical consequence of Octave striving to be Matlab-compatible: Matlab doesn't have ods* functions, only xls* functions.
doc/Spreadsheet-IO-in-Octave.html
This file.
xlsread.m (odsread.m)
All-in-one functions for reading data from one specific worksheet in a spreadsheet file. This script has Matlab-compatible functionality.
xlswrite.m (odswrite.m)
All-in-one function for writing data to one specific worksheet in a spreadsheet file. This script has Matlab-compatible functionality.
xlsfinfo.m (odsfinfo.m)
All-in-one function for exploring basic properties of a spreadsheet file. This script has Matlab-compatible functionality.
xlsopen.m (odsopen.m)
Function for "opening" (= providing a handle to) a spreadsheet file ("workbook"). This function sorts out which interface to use for .xls(x), .ods, .gnumeric or other file format access (i.e., COM; Java & Apache POI; jOpendocument; JexcelAPI; OpenXLS; etc.), but it's choice can be overridden.
xls2oct.m (ods2oct.m)
Function for reading data from a specific worksheet pointed to in a struct created by xlsopen.m. xls2oct can be called multiple times consecutively using the same pointer struct, each time allowing to read data from different ranges and/or worksheets. Data are returned in the form of a 2D heterogeneous cell array that can be parsed by parsecell.m. xls2oct is a mere wrapper for (hidden, "private") interface-dependent scripts that do the actual low-level reading.
oct2xls.m (oct2ods.m)
Function for writing data to a specific worksheet pointed to in a struct created by xlsopen.m. octxls can be called multiple times consecutively using the same pointer struct, each time allowing to write data to different ranges and/or worksheets. oct2xls is a mere wrapper for (hidden, "private") interface-dependent scripts that do the actual low-level writing.
xlsclose.m (odsclose.m)
Function for closing (the handle to) a spreadsheet file. When data have been written to the workbook in memory, xlsclose will write the workbook to disk. Otherwise, the file pointer is simply closed and possibly used interfaces for spreadsheet access (LibreOffice, COM/ActiveX/Excel.exe) will be shut down properly.
parsecell.m
Function for separating the data in raw arrays returned by xls2oct, into numerical/logical and text (cell) arrays.
chk_spreadsheet_support.m
Internal function for (1) checking, (2) setting up, (3) debugging spreadsheet support. While not specifically meant for direct invocation from the Octave prompt (it is more useful during initialization of Octave itself) it can be very helpful when hunting down issues with spreadsheet support in Octave
calccelladdress.m
Support function called by the script functions; calculates spreadsheet type row/column address based on 1-based row/column numbers.
test_spsh.m, io_testscript.m
Undocumented scripts for testing basic features of the spreadsheet scripts. Meant for testers and developers.
Private utility functions
Most of the above functions delegate actual file and data handling to "interface"-specific functions that are "private" (= hidden from the user and not directly callable by users).
Templates
The io package installation directory contains a "template" subdirectory. This contains empty spreadsheet "frameworks" that are only used as templates for writing data to new spreadsheet files. I do not think they can be read as-is by spreadsheet programs, so don't try that.
SUPPORT SOFTWARE
For the native Octave interface (OCT)
(read/write support for OOXML (Excel 2007+), ODS 1.2 (LibreOffice/OpenOffice.org Calc), and Gnumeric)
NO external support software is required!
For the Java / Apache POI / JExcelAPI / OpenXLS / LibreOffice | OpenOffice.org interfaces (general):
.ods file format
ODF Toolkit / ODFDOM specific:
The preferred versions are odfdom-0.8.8 and xercesImpl-2.11.0. Get them here:
http://odftoolkit.org/projects/odfdom/pages/Home
http://incubator.apache.org/odftoolkit/downloads.html (preferred)
Google for xerces-<version> download
jOpendocument specific:
http://www.jopendocument.org
.xls / .xlsx / .xlsm file formats
Apache POI specific:
class .jars: poi-3.5-FINAL-<date>.jar & poi-ooxml-3.5-<date>.jar (or later versions) in javaclasspath.
Get them here:
http://poi.apache.org/download.html
poi-ooxml-schemas-<version>.jar, + either xbean.jar or xmlbeans.jar and dom4j-1.6.1.jar in javaclasspath. Get them here:
http://poi.apache.org/download.html ("xmlbeans" and "poi-ooxml-schemas")
http://sourceforge.net/projects/dom4j/files ("dom4j-<version>")
Starting with POI-3.15, another Java class lib is required: commons-collections4-
Recent releases of Apache POI provide zipped / tar.gz'd archives with all required files inside.
JExcelAPI specific (just .xls):
http://sourceforge.net/projects/jexcelapi/files/
OpenXLS specific (just .xls):
http://sourceforge.net/projects/openxls/ (Be sure to get version 10), and
http://www.gwtproject.org/download.html ("Download GWT SDK")
These class libs must be referenced with full pathnames in your javaclasspath.
When the io package gets loaded, a utility function (__init_io__.m) invokes an initiaization function that tries to automatically find the Java class libs and adds the ones it found to the javaclasspath; When the io package gets unloaded, these same class libs will be removed from the javaclasspath.
On MinGW the required Java class libs had best be put in /<libdir>/java (where <libdir> on MinGW is usually /lib); on Linux system supplied Java class libs usually reside in /usr/share/java. Alternatively, you can put them in your HOME directory in a subdirectory java (mind case!) - on *nix that would be ~./java, on Windows %USERPROFILE%/java (same level as My Documents). The PKG_ADD routine, that gets run each time the io package is loaded, expects the class libs there; if they are elsewhere, add them in ./share/octave/<version>/m/startup/octaverc using appropriate javaaddpath statements or a chk_spreadsheet_support() call.
In addition, you can specify a subdirectory using the environment variable OCTAVE_IO_JAVALIBS.
Once a particular Java class lib has been added to the javaclasspath, it won't be searched anymore nor reloaded from the next search location. The search order is:
If you do not want to automatically load the Java class libs, specify a value of "no", "false" or "0" for the OCTAVE_IO_JAVALIBS environment variable before starting Octave.
.ods, .xls, .xlsx, .xlsm, .sxc + several other file formats
UNO specific (invoking OpenOffice.org (or clones) behind the scenes):
NOTE: EXPERIMENTAL!!
Be aware that OpenOffice.org/LibreOffice arch type (32-bit or 64-bit) must match Octave's arch type.
The utility function chk_spreadsheet_support had best be used to add the needed entries to the javaclasspath. The relevant Java class libs are unoil.jar, unoloader.jar, jurt.jar, juh.jar and ridl.jar (which are scattered around the OOo installation directory), while also the <OOo>/program/ directory needs to be in the classpath.
For the Excel/COM interface:
USAGE
xlsread (odsread) and xlswrite (odswrite) are mere wrappers for xlsopen (odsopen) - xls2oct (ods2oct) - xlsclose (odsclose) - parsecell and xlsopen (odsopen) - oct2xls (oct2ods) - xlsclose (odsclose) sequences, resp. They exist for the sake of Matlab compatibility.
xlsfinfo (odsfinfo) can be used for finding out what worksheet names exist in the file.
Invoking xlsopen/..../xlsclose directly provides for much more flexibility, speed, and robustness than xlsread / xlswrite. Indeed, using the same file handle (pointer struct) you can mix reading & writing before writing the workbook out to disk using xlsclose.
And: xlsopen / xlsclose hide the gory interface details from the user.
When using xlsopen....xlsclose be sure to keep track of the file handle struct.
A possible scenario:
xlh = xlsopen (<spreadsheet_filename> , [rw], [<requested interface>])
# Set rw to 1 if you want to write to a workbook immediately.
# In that case the check for file existence is skipped and
# -if needed- a new workbook created.
# If you really want an other interface than auto-selected
# by xlsopen you can request that. But xlsopen still checks
# proper support for your choice.
# Read some data
[ rawarr1, xlh ] = xls2oct (xlh, <SomeWorksheet>, <Range>)
# Be sure to specify xlh as output argument as xls2oct keeps
# track of changes and the need to write the workbook to disk
# in the xlhstruct. And the origin range is conveyed through
# the xlh pointer struct.
# Separate data into numeric and text data
[ numarr1, txtarr1, lim1 ] = parsecell (rawarr1)
# Get more data from another worksheet in the same workbook
[ rawarr2, xlh ] = xls2oct (xlh, <SomeOtherWorksheet>, <Range>)
[ numarr2, txtarr2, lim2 ] = parsecell (rawarr2)
# <... Analysis and preparation of new data in cell array Newdata....>
# Add new data to spreadsheet
xlh = oct2xls (Newdata, xlh, <AnotherWorksheet>, <Range>)
# Close the workbook and write it to disk; then clear the handle
xlh = xlsclose (xlh)
clear xlh
SPREADSHEET FORMULA SUPPORT, STRIPPING AND ENCODING
When using the COM, POI, JXL, OXS, UNO and OCT interfaces you can:
In short, you can enter spreadsheet formulas and in a later stage read them back, change them and re-enter them in the worksheet.
MATLAB COMPATIBILITY AND SOME GOTCHAS
xlsread, xlswrite and xlsfinfo are for the most part Matlab-compatible. Some small differences are mentioned below. When using the Java interfaces Octave supplies some formula manipulation support.
xlsread
Individual spreadsheet cells containing erroneous stuff are transferred to Octave as NaNs. But not all errors can be catched. E.g., spreadsheet cells showing #Value# in LibreOffice Calc often contain invalid formulas but may have a 0 (null) value stored in the value fields. It is rarely possible to catch this as there is no run-time formula evaluator (yet) in ODF Toolkit nor jOpenDocument (like there is in Apache POI for Excel).
Octave's xlsread (and for that matter, xlsfinfo as well) returns info about the actual (rather than the requested) cell range where the data came from. Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-)
Note that you can get an interesting confusing with regard to dates when reading .ods files and then writing them to disk s .xls or .xlsx files, or vice versa.
Octave's xlsfinfo also shows (and returns) the range of the smallest rectangle encompassing all occupied data ranges in each sheet.
COMPARISON OF INTERFACES & USAGE
However, all this comes at a price. Parsing xml trees into rectangular arrays is not quite straightforward and the other way round can be a real nightmare.
So, after reading the above you might appreciate that the io package can also invoke pre-baked libraries that simply shield away the gory details. Most if not all open-source ones are Java based and therefore platform-independent. Read on:
Unfortunately, with odftoolkit-0.6.0-incubating and odftoolkit-0.6.1-incubating (corresponding to odfdom-0.8.9 and 0.8.10) unresolved dependencies ("jenasin") have been introduced that break their functionality for Octave.
It is slower than native JXL let alone Excel & COM but it features active formula evaluation, although still not *all* Excel functions have been implemented (a daunting task for the POI devs, as it is hard to keep up with MS here). I've made the relevant subfunction (xls2jpoi2oct) fall back to cached formula results (and yield a suitable warning) for non-implemented Excel functions while reading Excel files.
The fact that upon a switch from reading to writing the existing spreadsheet on disk is overwritten in place by a blank one and that you can only get the contents back wen writing out all of the changes is worrying - and any change after the first write() is lost as a next write() doesn't seem to work, worse yet, you may completely loose the spreadsheet in question. The first is by JExcelAPI design, the second is probably a bug (in octave-forge/Java or JExcelAPI ? I don't know). Adding data to existing spreadsheets does work, but IMO undue user confidence is needed.
JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively). Upon overwriting, BIFF5 spreadsheets are converted silently to BIFF8.
JexcelAPI, unlike ApachePOI, doesn't evaluate functions while reading but instead relies on cached results (i.e. results computed by Excel itself). Depending on Excel settings ("Automatic calculation" ON or OFF) this may or may not yield incorrect (or expected) results.
A big stumbling block is that odsclose() on a UNO xls or ods struct will kill ALL OpenOffice.org invocations, also those that were not related to Octave! This is due to UNO-Java limitations. In fact this is the reason UNO is still considered experimental
The underlying issue is that after Octave started an Libreoffice invocation, Libreoffice must be closed for Octave to be able to exit; otherwise Octave will wait for LO to shut down before it can terminate itself. So Octave must kill LO to be able to terminate.
A way out hasn't been found yet and may even not exist.
A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc.; modern Excel versions can read .ods.
Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.
A NOTE ON JAVA MEMORY USAGE
Java memory pool allocation size
-Xmx512m
For further details consult the Octave manual, "Java Interface", "FAQ", "How can I handle memory limitations?"
TROUBLESHOOTING
As of April 2011 a special purpose setup file has been included in the io package (chk_spreadsheet_support.m) in which error checking and troubleshooting have been automated; it has evolved over time with core Octave itself.
When running it with the second input argument (debug level) set to 3 a lot of useful, hopefully self-explanatory diagnostic output will be printed to screen.
DEVELOPMENT
xlsopen.m merely needs two stanzas, xlsfinfo.m, xls2oct.m, oct2xls.m and getusedrange.m each need an additional elseif stanza, and xlsclose.m needs a small stanza for closing the pointer struct and writing to disk.
Separating the file access functions and the actual reading/writing from/to the workbook in memory has made developer's life (I mean: my time developing this stuff) much easier.
Using ActiveX/COM an entire array is handed as an object from Excel to Octave and parsed in the Octave world; using OCT a worksheet is scanned using a regular expression for each data type in one fell swoop, and spreadsheets usually only have a limited number of data types (just double, text, logical, and some derived types).
The again each spreadsheet interface support SW has its own date representation and epoch which makes Matlab compatibility over all interfaces an elusive goal.
Enjoy!
Philip Nienhuis, February 22, 2020