Microsoft Unveils a Large Language Model That Excels At Encoding Spreadsheets
The first of these modules involves placing “structural anchors” throughout the spreadsheet to help the LLM understand what’s going on better. It then removes “distant, homogeneous rows and columns” to produce a condensed “skeleton” version of the table. Index translation addresses the challenge caused by spreadsheets with numerous empty cells and repetitive values, which use up too many tokens. “To improve efficiency, we depart from traditional row-by-row and column-by-column serialization and employ a lossless inverted index translation in JSON format,” Microsoft wrote. “This method creates a dictionary that indexes non-empty cell texts and merges addresses with identical text, optimizing token usage while preserving data integrity.” […]
After conducting a “comprehensive evaluation of our method on a variety of LLMs” Microsoft found that SheetCompressor significantly reduces token usage for spreadsheet encoding by 96%. Moreover, SpreadsheetLLM shows “exceptional performance in spreadsheet table detection,” which is the “foundational task of spreadsheet understanding.” The new LLM builds on the Chain of Thought methodology to introduce a framework called “Chain of Spreadsheet” (CoS), which can “decompose” spreadsheet reasoning into a table detection-match-reasoning pipeline.
Read more of this story at Slashdot.