Thursday, September 15, 2011
12:00 pm - 1:00 pm
PLACE: Gates 8102
SPEAKER: Ted Dunning, MapR
TITLE: Architectural Details of the MapR File System
Map-reduce systems such as Hadoop provide the ability to scale certain classes of computations to very large scale by only allowing certain limited forms of functional programs to be executed. Traditionally, however, the file system associated with map-reduce frameworks has used a centralized meta-data architecture with write-once files in order to simplify the implementation.
There are two major consequences of this decision. The first is that the shuffle phase of map-reduce is forced onto local file systems. This has a number of indirect costs, particularly for computations with highly skewed spill sizes. A second consequence is that the file system is not interoperable with legacy code. Since map-reduce systems work on a very large scale and therefore must deal with failures, relaxing the write-once restriction has commonly been assumed to be difficult to achieve without significant performance loss.
The MapR file system proves otherwise. It provides a full read-write file system with completely distributed meta-data storage and transactional semantics that are reliable, performant and durable even in the presence of node failures.
I will present the key innovations that make this advance possible, describe how this solves the local spill problem and describe practical day-to-day implications of the first class nature of the MapR file system.
Ted Dunning has a background in machine learning and large systems architecture in both academic and industrial settings. As an academic, his paper on statistical methods for computational linguistics has been cited nearly 1500 times and the methods from that paper continue to provide the underpinnings for a wide variety of systems in document classification, fraud detection and recommendations. In industry, he has worked in half a dozen startups as early employee or founder. While at these companies Ted's work has broken new ground in music recommendations and streaming (Musicmatch), ad targeting (Aptex), identity fraud (ID Analytics), and video search and peer to peer networking (Veoh Networks). MapR is now providing the most advanced map-reduce platform available commercially. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
VISITOR HOST: Garth Gibson
SDI / ISTC Seminar Questions?
Karen Lindenfelser, 86716, or visit www.pdl.cmu.edu/SDI/