The research target is ambitious: We want to explore the meaning of spreadsheet cells. What is 4.532,23 in cell C12? Is it Sales? January Sales? January Sales of Cheese sandwiches? Which cells contain these labels? We need data, lots of it, to train and validate our algorithms. We have thousands of spreadsheets (from the EUSES and the Enron corpus). How can we enable users to help us assign meaning to arbitrary cells? We need some interface for them to match cells to their labels.
This is how the story of the spreadsheet labeling game began. It would be a webpage, showing a worksheet randomly picked from the corpus, with a randomly picked selected cell, asking users which cells describe or label it, letting them select some cells, and saving the answers. For decoration, cute chains of smiley faces (for marrier users) and usage counters (for the more competitive ones).
But how do we show the worksheets to the user? And how do we collect the users’ cell selections?
The first question was easy to answer: Google Sheets. Upload the all spreadsheets to Google Sheets and use their well-documented API to handle everything. I was one happy programmer, until I searched for an answer to the second question. Apparently, in the Google Sheets events there is Open, Change, Edit, Form submit, and that’s it. No cell click, no cell selection change, no way to collect which cell the user is selecting in the spreadsheet. We had to find another way.
We then briefly considered using the Handsontable data grid component. Instead of embedding spreadsheets to webpages, worksheets would be transformed to data grids with Excel-like appearance. The cell click collection requirement would be covered using Handsontable’s afterSelection events. However, we preferred to show the actual worksheet to the user, without converting it to any other format. This would be our backup solution, which we did not choose in the end.
But how do we retrieve the tokens that the Excel Web App requires for the thousands of spreadsheets in our corpus?
We would have to use the OneDrive API. To use it, one needs to create an application, register it with Microsoft for the OneDrive API and get an access token. Logical, but too much when one just needs the list of tokens for the files in a specific OneDrive folder. Luckily, we discovered that we could exploit the SDK provided for trying out the OneDrive API to get just what we needed. Problem solved!
So, how do we change the style of cells when the API does not support it?
It was a tricky implementation with many ups and downs, but in the end we were happy that the website delivered rich results –up to now, more than 40.000 cells were labeled with more than 95000 labels.
Liked the story? Check the spreadsheet labelling game out, or read the paper that used its results.