datproject-discussions icon indicating copy to clipboard operation
datproject-discussions copied to clipboard

Astronomy use case: Trillian

Open joehand opened this issue 8 years ago • 7 comments

From @demitri on August 26, 2014 23:23

Trillian Data Needs

This document will describe the data needs in detail for the Trillian project. Trillian is an attempt to address the difficulty (or inability) to easily analyze the hundreds of terabytes of publicly available astronomical data.

Introduction

Trillian is designed to be a computing engine for astronomical data, consisting of two basic parts. The first is the computational aspect, where users will create astrophysical models (in practice, Python code) that describe a particular object — a type of star, galaxy, etc. — which will then be applied to all available data. The result is a likelihood value assigned to each object analyzed based on how will the data matches the model. The second component is a distributed data network. No astronomy department or institution has the disk space to store the amount of data available, and even if they did, the bookkeeping and organization is well beyond the time and capabilities of most astronomers. This document will focus on describing the latter component and the nature of astronomical data.

Astronomy Data is Multi-Wavelength

People are familiar with the idea of looking at something to study it, but if you are using only your eyes you see just a very narrow part of the electromagnetic spectrum. I appear very different when someone looks at me with their eyes (visible), with a pair of night-vision glasses (infrared), or takes an x-ray photo of me. My iPhone still works if I completely cover it with my body because I am transparent to both radio and WiFi signals. This is how astronomers understand the universe; we observe things in multiple wavelengths. Dust in the galaxy blocks optical light, but the longer wavelength infrared light passes right through, allowing us to see beyond. Further, the wavelength corresponds to temperature – an object emitting light at short wavelengths (e.g. gamma rays, x-rays) is far hotter than one emitting long wavelengths (e.g. radio waves). By studying objects in different wavelengths, we area actually studying different physical processes.

Telescopes, satellites, or other astronomy detectors typically operate in a single wavelength or a very narrow range (compared to the spectrum). Consequently, data releases from a survey cover one or a few wavelengths. To fully understand a particular object (e.g. a star, a planet, a galaxy), one wants to collect as many observations covering as many wavelengths as possible. Currently, this means going to several web sites where the data is available, each with a very different interface (IF there is a web interface and not just files!), each with a different structure. Manually collating these observations is tedious and time consuming, and doing this for hundreds of thousands of objects is nearly impossible. This is the problem we want to solve.

Astronomical Data Formats

Astronomy data is typically found in one of two formats: flat (ASCII) files or FITS format. If I give an astronomer an image taken from a telescope, alone it’s almost worthless. She would need to know the position on the sky, the exposure time, the location of the telescope, the instrument used, the wavelength, etc. Rather than keep this metadata in a separate file from the image data, it’s kept in a header associated with the image in the same file. This header is simply a collection of key-value pairs. Together, the image and the header form a header data unit (HDU). The data may take the form of an image or a table (up to 999 columns). Finally, a FITS file may contain any number of HDUs.

A data release from a survey will either be a large collection of ASCII table files or FITS files, where there is no standard/common format of the number of HDUs in a file, although there is a small number of header keywords that are standard. This is necessary due to the complexity of the data, but it makes organizing the data more difficult. Data releases are typically the result of many years of observation and analysis. The larger surveys (e.g. SDSS) will create a new data release every few years, and this completely supersedes the last one (though it’s useful to at least keep identifiers that people use from older releases). Some surveys’ releases are incremental (e.g. Hubble), where the intervening time was spent observing completely new objects. It is uncommon for surveys to release data in small, frequent doses; astronomical data can be to first order treated as large, unchanging data sets.

How Trillian Will Oragnize Data

Data releases, as above, are primarily cover one or a few wavelengths. If one wants to combine as many observed wavelengths of a single object together, it’s not efficient to store them as such. There is a system called Healpix that divides a sphere into equal sized pixels. This is preferred over something like the longitude/latitude system where the area covered by one degree in longitude differs by the latitude position. Trillian will take a single Healpix pixel and collect all available information that is located in that position on the sky from each data set. Let’s call this a Pixel (with a capital ‘p’) for now.

The data storage in Trillian will be distributed; this is how we will scale. For example, imagine we have a central server at OSU in Ohio. A server at NYU in New York has 10TB to offer. Trillian will determine how many Pixels can be stored in that space, assemble them, and place them there; this is now a storage node. A PostgreSQL database on the central server keeps track of all the Pixels. Each node will also have a PostgreSQL server to manage the data there.

Data Access

The central server will want to retrieve all of the information available for a given object in the sky. Based on the position, it will know which Pixel and node the data is stored. We want to get this data to feed to a program to analyze it (apply the model). There are two scenarios: the storage node is also a compute node, and the model will be sent to the node, or the compute node is elsewhere and will have to retrieve the data. We would like to implement an API such that a message can be sent to the storage node and have it retrieve the data. It shouldn’t matter to the system whether the data is on a remote node or local – it will just be a call to the API, where the location is just a parameter.

Some of the data will be in a tabular format, which is probably best kept in a database to allow for complex queries. However, some data will be in the FITS format. Some of that can be fully uploaded into a database (the schema of course being more than a single table). However, there is no benefit to loading images into a database – they cannot be searched on. The API though may need to open the image, extract some array of pixels, and return that. There will then be a need for translators that will know the particular details of the file and the data format to extract the data on demand.

Additional information

Example data

Copied from original issue: maxogden/dat#172

joehand avatar Jun 17 '16 18:06 joehand