The spreadsheet labeling game: an implementation story

August 2015

The research target is ambitious: We want to explore the meaning of spreadsheet cells. What is 4.532,23 in cell C12? Is it Sales? January Sales? January Sales of Cheese sandwiches? Which cells contain these labels? We need data, lots of it, to train and validate our algorithms. We have thousands of spreadsheets (from the EUSES and the Enron corpus).  How can we enable users to help us assign meaning to arbitrary cells? We need some interface for them to match cells to their labels.

This is how the story of the spreadsheet labeling game began. It would be a webpage, showing a worksheet randomly picked from the corpus, with a randomly picked selected cell, asking users which cells describe or label it, letting them select some cells, and saving the answers. For decoration, cute chains of smiley faces (for marrier users) and usage counters (for the more competitive ones).

The toolbox for the implementation was ready: Javascript + jQuery for the client, Bootstrap for styling, a simple backend server, text files for the output. Pretty standard, right?

But how do we show the worksheets to the user? And how do we collect the users’ cell selections?

The first question was easy to answer: Google Sheets. Upload the all spreadsheets to Google Sheets and use their well-documented API to handle everything. I was one happy programmer, until I searched for an answer to the second question. Apparently, in the Google Sheets events there is Open, Change, Edit, Form submit, and that’s it. No cell click, no cell selection change, no way to collect which cell the user is selecting in the spreadsheet. We had to find another way.

We then briefly considered using the Handsontable data grid component.  Instead of embedding spreadsheets to webpages, worksheets would be transformed to data grids with Excel-like appearance. The cell click collection requirement would be covered using Handsontable’s afterSelection events. However, we preferred to show the actual worksheet to the user, without converting it to any other format. This would be our backup solution, which we did not choose in the end.

After lots of searching, it was Microsoft’s OneDrive and Excel Web App that we resorted to. Surprisingly, among the very basic functionality that the Excel Services JavaScript API offers, an activeCellChanged event handler is included. That was it! We loaded a spreadsheet in an OneDrive folder, retrieved its token (“SD72E74B1ABFC5464!1390/517479313637659748/t=0&s=0&v=!AKNivqAfTbPL7nU”) from the Javascript produced when selecting “Embed”,  and then started implementing the opening, handling user selections and saving the results.

But how do we retrieve the tokens that the Excel Web App requires for the thousands of spreadsheets in our corpus?

We would have to use the OneDrive API. To use it, one needs to create an application, register it with Microsoft for the OneDrive API and get an access token. Logical, but too much when one just needs the list of tokens for the files in a specific OneDrive folder. Luckily, we discovered that we could exploit the SDK provided for trying out the OneDrive API to get just what we needed. Problem solved!

And then we hit a different wall. The Excel Services JavaScript API is not offering something that we took for granted: It does not support setting the style of the cells. For the spreadsheet labelling game this was absolutely essential. The randomly-picked question cell should be clearly marked, and so should be the cells that the user selects as labels.

So, how do we change the style of cells when the API does not support it?

That took a bit of MacGyverism: we make Excel do it through conditional formatting rules. Before uploading the Excel files to OneDrive, a hidden worksheet is added to each of them, along with a conditional formatting rule “=NOT(ISERROR(FIND(SUBSTITUTE(TEXT(ADDRESS(ROW(),COLUMN()), “”)&”,”, “$”,””),hidden!$A$1)))”, applied to all non-hidden workbook cells, which makes Orange the cell whose address is written in cell hidden!A1, and a similar rule that makes Green the cells written in hidden!A2. This way, when a user selects a cell, we use the Excel Services JavaScript API to write its address to cell hidden!A2, and the cell automatically turns Green. Funny how Excel ended up solving this problem too!

It was a tricky implementation with many ups and downs, but in the end we were happy that the website delivered rich results –up to now, more than 40.000 cells were labeled with more than 95000 labels.

Liked the story? Check the spreadsheet labelling game out, or read the paper that used its results.