What you should know about handling big data
Written by Teng-Leong Chew and Eric Wait
The era of “big data” in imaging
With the recent advent of several microscopy technologies, especially light sheet microscopy, with which millions of voxels can be acquired in a matter of seconds, the era of “big data” of imaging has arrived. It is routine, for example, for the AIC visitors to acquire 30-40 TB of data in a typical 2-week visit. The SiMView light sheet microscope can generate data at a rate of terabytes by the hour. Since it is designed for long-term imaging with relatively high volumetric speed, it is not uncommon for SiMView to generate 20 TB of data per experiment. Likewise, focused ion beam-scanning electron microscope (FIB-SEM), which is capable of tiling large volume of tissue at resolution as high as 2nm x 2nm x 2nm, can acquire data just as big and will pose challenges in 3D object segmentation. The same is applicable to lattice light sheet microscope with adaptive optics (LLSM-AO) which the AIC will offer in 2019.
In general, the AIC will provide visitors with data that have already been preprocessed, including deskewing and deconvolving. Currently, we routinely compress the data using the lossless KLB format developed by Philipp Keller’s group (Amat et al.). Despite these steps, the amount of data that need to be transferred back to the visitor’s home institution remain considerable. AIC visitors should therefore be advised to consider the following issues. The challenges in “handling” the AIC data for our visitors can be categorized into three sections: (i) data transfer, (ii) data storage, and (iii) data computation for image analysis.
Due to the lagging global infrastructure that can support high-speed data transfer, there is currently no reasonably efficient methods for transferring data larger than 4-5 TB from one institution to another in a timely manner. The fastest method remains the archaic approach of shipping external hard drives via courier service. The cost of 8TB external drives is ~US$150. AIC visitors who use the above-mentioned instruments (SiMVIew, LLSM-AO and FIB-SEM) should budget at least $1,200 for the purchase of external drives. One way to overcome this problem is to set up a physical connection between Janelia and a local cloud server, for example Amazon. Data can be transferred straight to cloud storage quickly. An additional advantage of cloud storage is that it will also facilitate cloud computing. Algorithms developed by the AIC can be easily developed and deployed to our visitors. However, the cost of keeping the data on the cloud storage will quickly add up for the visitors. See below.
A majority of the AIC visitors decide to transfer data home in multiple external drives. The immediate challenge facing them when arriving back at their home institution is uploading and storing the data. Not every institution offers (i) high speed connection for uploading, (ii) backed-up system for storage, and (iii) cluster-accessible storage for downstream computing. It is not uncommon to see AIC visitors leaving the collected data on the external drives as the only copy. The AIC however, will only stores the data for our visitors for up to two months. It is therefore the visitor’s responsibility to store and back up the data within this period of time. It is not advisable to store data on external drives for extended period of time. While cloud storage is readily available, it can get cost prohibitive quickly. It is important to note that there is no cost for uploading data to commercial cloud storage services such as that offered by Amazon. However, storing and downloading such large amount of data would incur significant cost (~US$1,700 each month for 50TB storage, bandwidth larger than 50TB/mo is an additional charge).
The AIC offers assistance with image analysis, quite often involving extensive scripting to create new algorithms that our visitors can take home for further analysis. Yet, one problem faced by many investigators is that the data sets are so big that their local computers simply cannot handle the computation in a reasonable amount of time. This negates the benefits of having tailor-scripted algorithms for their experiments. The most effective way of overcoming this problem would be to perform cloud computing. And this returns the problem back to the issues in cloud storage.
How will data handling affect your AIC proposal?
In light of these problems, the ability of the applicants to handle big imaging data becomes a factor that the review panel will evaluate when the requested instrument involves SiMView light sheet microscope, lattice light sheet microscope with adaptive optics and focused ion beam scanning electron microscope. The applicants should describe their institutional information technology infrastructure to demonstrate that the data generated at the AIC will not go unutilized.
Amat, F., Hockendorf, B., Wan, Y., Lemon, W. C., McDole, K., & Keller, P. J. (2015). Efficient processing and analysis of large-scale light-sheet microscopy data. Nat. Protocols, 10(11), 1679–1696.