Information Dynamics to present at the American Geophysical Union (AGU)
David Hill and Jason Werpy, Senior Software Architects at the USGS EROS Data Center, will be presenting at the 2011 fall AGU meeting (http://sites.agu.org/fallmeeting/) in San Francisco, California on Thursday, December 8th. David Hill will be giving a presentation titled Satellite Imagery Production and Processing Using Apache Hadoop. The presentation will detail his work at EROS Data Center creating a framework for science data processing. Jason Werpy will be presenting a poster talk on Metrics Measurement for Data Services. This presentation will outline the challenges in collection metrics for data processing services offered by the LP DAAC.
Abstract: Satellite Imagery Production and Processing Using Apache Hadoop (D. V. Hill)
The United States Geological Survey's (USGS) Earth Resources Observation and Science (EROS) Center Land Science Research and Development (LSRD) project has devised a method to fulfill its processing needs for Essential Climate Variable (ECV) production from the Landsat archive using Apache Hadoop.
Apache Hadoop is the distributed processing technology at the heart of many large-scale, processing solutions implemented at well-known companies such as Yahoo, Amazon, and Facebook. It is a proven framework and can be used to process petabytes of data on thousands of processors concurrently. It is a natural fit for producing satellite imagery and requires only a few simple modifications to serve the needs of science data processing.
This presentation provides an invaluable learning opportunity and should be heard by anyone doing large scale image processing today. The session will cover a description of the problem space, evaluation of alternatives, feature set overview, configuration of Hadoop for satellite image processing, real-world performance results, tuning recommendations and finally challenges and ongoing activities. It will also present how the LSRD project built a 102 core processing cluster with no financial hardware investment and achieved ten times the initial daily throughput requirements with a full time staff of only one engineer.
Abstract: Metrics Measurement for Data Services (J. Werpy)
Recent advancements in Earth Science Systems data distribution include the addition of services that manipulate and process data before delivery to the user. Historically, users must process data locally after downloading inputs from a data center. The metrics for user downloads represent a large volume of data distribution, against which archive performance is measured. With value-added services delivering data processed into smaller, user-defined products, new metrics are required to adequately characterize the effectiveness of service-based distribution.
The first key indicator is the reduced volume of data that the users download in relation to the size of the inputs. The processing effort applied to processing the final product is a second metric needed to scope system infrastructure to support value-added services. Third, in certain cases processing may occur at a location other than the host archive, or data from multiple archives is utilized to create a fusion product. Metrics tracking distribution from the originating data source through a processing center to a user introduces the complexity of potentially overlapping or excluded measurements.
Providing reduced-volume products from multiple data centers is a significant advancement in science data distribution, and should be tracked accordingly. The responsibility of accurately measuring a data center's capacity and success distributing products through value-added services is achievable with the implementation of new methods for metrics collection.