[ Article ]
Journal of the Korean Astronomical Society - Vol. 51, No. 3, pp.65-71
ISSN: 1225-4614 (Print) 2288-890X (Online)
Print publication date 30 Jun 2018
Received 05 Apr 2018 Accepted 11 Jun 2018

# GPU-ACCELERATED SPECKLE MASKING RECONSTRUCTION ALGORITHM FOR HIGH-RESOLUTION SOLAR IMAGES

Yanfang Zheng ; Xuebao Li ; Huifeng Tian ; Qiliang Zhang ; Chong Su ; Lingyi Shi ; Ta Zhou
College of Electrical and Information Engineering, Jiangsu University of Science and Technology, Zhangjiagang 215600 zyf062856@163.com

Correspondence to: X. Li

JKAS is published under Creative Commons license CC BY-SA 4.0.

## Abstract

The near real-time speckle masking reconstruction technique has been developed to accelerate the processing of solar images to achieve high resolutions for ground-based solar telescopes. However, the reconstruction of solar subimages in such a speckle reconstruction is very time-consuming. We design and implement a new parallel speckle masking reconstruction algorithm based on the Compute Unified Device Architecture (CUDA) on General Purpose Graphics Processing Units (GPGPU). Tests are performed to validate the correctness of our program on NVIDIA GPGPU. Details of several parallel reconstruction steps are presented, and the parallel implementation between various modules shows a significant speed increase compared to the previous serial implementations. In addition, we present a comparison of runtimes across serial programs, the OpenMP-based method, and the new parallel method. The new parallel method shows a clear advantage for large scale data processing, and a speedup of around 9 to 10 is achieved in reconstructing one solar subimage of 256×256 pixels. The speedup performance of the new parallel method exceeds that of OpenMP-based method overall. We conclude that the new parallel method would be of value, and contribute to real-time reconstruction of an entire solar image.

## Keywords:

Sun: general, instrumentation: high angular resolution, methods: numerical

## 1. INTRODUCTION

solar ground-based high resolution imaging, speckle image reconstruction is used to yield diffraction-limited resolutions for partially corrected images with an adaptive optics system, such as speckle masking (Lohmann et al. 1983; von der Lühe 1993), phase diversity methods (Gonsalves 1982; Paxman et al. 1992), and shift-and-add (SAA) (Li et al. 2014). Due to its high accuracy of image reconstruction, the speckle masking algorithm is commonly used in post-processing reconstruction. Because of the large amount of data and calculations involved in this algorithm, the long reconstruction time has limited universal application of the algorithm. Near real-time image reconstruction performance was obtained for the New Vacuum Solar Telescope (NVST) on a high performance cluster using the Message Passing Interface (MPI). The program we developed requires 48 seconds to reconstruct one 2368×1920 pixels image from one burst with at least 100 original images on the multi-core CPU cluster (Li et al. 2015). However, much time (one-third of the computation time) was used to reconstruct subimages from the many burst subimages in such a speckle reconstruction. Real-time image reconstruction can not only shorten the time between the observations and data analysis, but also reduce the data volume at least by a factor of 100. Thus, it is essential to parallelize and accelerate subimage reconstruction to further enhance the speedup performance.

Some aspects of using speckle masking algorithms and parallel computing techniques to reconstruct solar observation images have been already studied by Li et al. (2015), Wöger & von der Lühe (2008), and Li & Zheng (2016). They used traditional MPI and Open Multiple Processing (OpenMP) to accelerate solar images reconstruction in near real-time based on multi-core CPU. General Purpose Graphics Processing Units (GPGPU) is a novel parallel computing technique and has been widely applied to real-time astronomical data processing. The Daniel K. Inouye Solar Telescope (DKIST) is about to apply GPUs to accelerate data processing of visible broadband imager (Beard et al. 2014). NVST applied GPUs to accelerate frames-selection method of the Level1 data processing (Shi et al. 2015). However, only Wöger & Ferayorni (2012) have attempted to accelerate speckle masking in one solar subimage reconstruction on GPUs. However, they considered harnessing parallelization within the average bispectrum calculation algorithm itself, and no significant speedups were gained. By analyzing the speckle masking algorithm in subimage reconstruction, we find that the average bispectrum calculation is most time-consuming.

In this paper, we propose a new parallel method for speckle masking algorithm for solar subimage reconstruction using GPGPU. Our method differs from Wöger & Ferayorni (2012) in that the method accelerates the average bispectrum computation algorithm. Furthermore, we compare the GPU-based solutions to the OpenMP multi-core CPU execution described in Li & Zheng (2016), the computation speed of solar subimage reconstruction is increased significantly. The rest of the paper is organized as follows. We describe the new parallel method for speckle masking algorithm in Section 2. We present results and analysis in Section 3. Finally, we present our conclusions and future work in Section 4.

## 2. CUDA METHOD FOR THE SPECKLE MASKING ALGORITHM

In general, one burst with at least 100 short exposure images is converted into a high resolution image. One burst with at least 100 preprocessed images is segmented into a mosaic of partially overlapping subimage bursts with 100×2562 pixels (Li et al. 2015). The processing of each subimage burst is relatively independent of other subimage regions. Each subimage burst provides the modulus and phases of the object’s Fourier transform, using the method of Labeyrie (1970) and speckle masking respectively (Lohmann et al. 1983). All the reconstructed subimages are merged to form a full high resolution solar image (Li et al. 2015). The speckle masking algorithm uses the bispectrum of the Fourier transforms which is defined by

 $B\left(u,v\right)\begin{array}{c}={〈{I}_{i}\left(u\right){I}_{i}\left(v\right){I}_{i}^{*}\left(u+v\right)〉}_{i}\\ ={O}_{i}\left(u\right){O}_{i}\left(v\right){O}_{i}^{*}\left(u+v\right)\\ \cdot {〈{H}_{i}\left(u\right){H}_{i}\left(v\right){H}_{i}^{*}\left(u+v\right)〉}_{i},\end{array}$ (1)

where u and v are the two-dimensional spatial frequencies. Ii(u)Ii(v)I*i (u+v) represents the ith speckle masking bispectrum; <Hi(u)Hi(v)H*i (u + v)>i denotes the average speckle masking transfer function. After obtaining the average bispectrum, the object’s phase can be recovered from it. A detailed description of the technical parts of the phase reconstruction method is given in Pehlemann & von der Lühe (1989).

The Compute Unified Device Architecture (CUDA) is a general purpose parallel computing architecture with a new parallel programming model and instruction set architecture developed by the NVIDIA Corporation (CUDA Toolkit 2018). CUDA leverages the parallel engine in GPUs to solve many complex computational problems in a more efficient way than on a single CPU. It allows developers to use C or C++ as programming languages and to create high-level CPU+GPU applications without the need to explicitly manage the parallel architecture configuration and its communication with the CPU host (Flores et al. 2014). In our work, we propose a novel parallel method for speckle masking algorithm for solar subimage reconstruction based on CUDA. In the CUDA program model, CPU and GPU work together and execute their respective tasks. The GPU accelerates the computationally intense and time-consuming code, and the CPU takes advantage of its optimized ability to execute serialization code. The code to be executed on the GPU is grouped into data parallel functions called kernels. By analyzing the speckle masking algorithm on the CPU, we find that the most time-consuming process is the average four-dimensional (4D) bispectrum computation. Therefore, in our algorithm, main functions such as the batched two-dimensional (2D) Fast Fourier Transform (FFT), batched 2D spectrum shift, batched 2D spectrum conjugate, average 4D bispectrum initiation and computation are implemented in C language for the GPU. Other processes that do not take much time, such as Fourier phase and amplitude recovery, are implemented on the CPU, and the work flow is controlled by CPU. The flow chart of the new parallel method for speckle masking algorithm is shown in Figure 1.

Flow chart of the new parallel speckle masking algorithm based on CUDA on GPGPU.

One subimage burst was made up of a batch of 100 2D 256×256 pixels subimages. The implementation of the CUDA-based parallel method for one subimage reconstruction covers the following steps, and the significant part of the pseudo code is shown in Figure 2. Firstly, one subimage burst is preprocessed on the CPU, and then the data is transferred to the global memory on the GPU from the CPU host memory. Secondly, the batched 2D FFT operation is executed using the cuFFT library distributed with CUDA and multi-threads kernel on GPU (cuFFT Library 2018), and then a batch of 100 2D spectra is acquired. Thirdly, three different kernels are implemented to execute parallel computing tasks with a large of number of threads on the GPU. These kernels correspond to 2D spectrum shift and conjugation, and initiation of the averaging of the 4D bispectra, respectively, for the batch of 100 images. Fourthly, the computation of the average 4D bispectrum in serial mode requires five loops, which is most time-consuming. Therefore, for the module in parallel mode, the major part of computing is reduced to three loops by unrolling two inner loops and calling the kernel bispectrum kernel multiple times on CPU+GPU. Fifthly, the data of the average 4D bispectrum is transferred to CPU host memory from GPU device memory, and then is used to reconstruct the object’s phase on the CPU. Finally, the object’s amplitude is recovered using the method of Labeyrie (1970), and then one final subimage is reconstructed through inverse Fourier transformation of the phase and amplitude on the CPU.

The significant part of the pseudo code for subimage reconstruction using speckle masking based on CUDA.

The new parallel method avoids high data transfer rates between host memory and device memory. In the subimage parallel reconstruction, only one copy of subimage burst and average 4D bispectrum needs to be transferred between host memory and GPU device memory, and the data volume is so small that the time of data transfer can be neglected. The time cost of several individual parallel reconstruction steps is presented in the next section.

## 3. RESULTS AND ANALYSIS

For experimental purposes, we used actual solar image data acquired from the NVST. One solar subimage was reconstructed from a batch of 100 256×256 pixels subimages in the photosphere TiO channel. Figure 3(a) shows one observational original subimage of one subimage burst before reconstruction. Figure 3(b) shows the parallel reconstruction subimage using speckle masking based on CUDA on GPGPU, and its field of view is slightly smaller than that of original subimage. The spatial resolution of the reconstructed subimage is significantly better than that of original subimage. Similar images constructed using other methods are presented elsewhere (Li & Zheng 2016). There is no difference among the quality of the reconstruction, because the different methods differ mostly in the speed. All experiments were performed on a Linux operating system on a computer equipped with an Intel Xeon E5-2620 CPU at 2.00 GHz (total 16 CPU cores), 32 GB host memory, and an NVIDIA Tesla C2050 GPU with 448 cores and 3 GB device memory. The hardware specification is illustrated in Table 1 and Table 2.

Specifications of the host PC

Specifications of GPU device

a) One subimage of one subimage burst of 100×2562 pixels before reconstruction. (b) The parallel reconstruction subimage using speckle masking based on CUDA on GPGPU. (c) and (d) are the fields from which (a) and (b) are taken.

In order to demonstrate the performance advantages of our proposed method, we present a runtime and speedup performance comparison based on reconstructing one subimage from one subimage burst of 100×2562 pixels with various modules with a CPU (single core) serial program, an OpenMP-based (16 CPU cores) method, and the CUDA-based method, in Table 3. The CPU parallel processing algorithm is implemented using OpenMP. From Table 3, it is seen that the runtime of the CUDA-based method is much smaller than that of the single core CPU serial program and the OpenMP-based method. The runtime of reconstruction of one entire subimage is reduced to around 0.7 seconds based on CUDA, outperforming the OpenMP-based methodnby a factor of around 10 in speed. The most time-consuming module, the average 4D bispectrum calculation, shows a substantial speed increase with a speedup factor of around 17. In the module for the batched 2D FFT, there is a speedup by a factor of about 30 when cuFFT library and multi-thread kernel are used to perform the FFT operation for one subimage burst on GPU. The processing speed of the other two modules is about 3-5 times faster than that of the serial program. From the above comparison results, we can see that the processing speed of several reconstruction steps is increased significantly, and the accelerating performance of the CUDA-based method is superior to that of OpenMP-based method overall.

Runtime and speedup performance measurements for one subimage burst of 100 × 2562 pixels between various modules.

Figure 4 shows a computing time comparison of one subimage reconstruction between serial program, OpenMP-based method, and CUDA-based method using different sizes of data sets. It can be seen that the computation time of the new CUDA-based parallel method invariably decreases compared to that of the serial program and OpenMP-based method, when the number of data sets becomes larger. The speedup ratio of CUDA-based method maintains around 9-10, while that of OpenMP-based method maintains around 2-3, as the size of data sets increases from set1 to set5. Therefore, the new parallel method based on CUDA in our work becomes more efficient than serial implementation and OpenMP-based parallelmethod for large scale data processing.

Subimage reconstruction time for a single core CPU serial program, the OpenMP-based method, and the CUDA-based method for different data sets: set1=100×2562 pixels, set2=125×2562 pixels, set3=150×2562 pixels, set4=175×2562 pixels, set5=200 × 2562 pixels.

## 4. CONCLUSIONS

In this study, we design and implement a new parallel method for speckle masking reconstruction of solar subimages based on CUDA on GPGPU. Tests are performed to verify the correctness of our program in C language on NVIDIA Tesla C2050. The details of several parallel reconstruction steps are presented, and the parallel implementation among various modules shows a great speed increase as compared to the previous serial implementation. We also compare the runtime of these modules between the new parallel method and OpenMP-based method, and the speedup performance of the new parallel method exceeds that of OpenMP-based method overall. Furthermore, we present a comparison in the consumed time among serial program, OpenMP-based method, and the new parallel method. The new parallel method shows a clear advantage for large scale data processing, and a speedup of around 9-10 is achieved in reconstructing one solar subimage of 256×256 pixels.

The current parallel method is applied to accelerating the reconstruction of one subimage based on CUDA on one GPGPU. In our future work, we will port the code of the CUDA-based implementation to a high performance cluster with more GPUs, and accelerate the reconstruction of all subimages. In summary, our new parallel method would be valuable, and contribute to real-time reconstruction of a full high resolution solar image.

## Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 11703009), and the Natural Science Foundation of Jiangsu Province, China (No. BK20170566). The authors gratefully acknowledge the helpful comments and suggestions of the reviewers.

## References

• Beard, A., Cowan, B., & Ferayorni, A., (2014), DKIST Visible Broadband Imager Data Processing Pipeline, SPIE, 9152, p91521J.
• CUDA Toolkit, (2018), https://developer.nvidia.com/cuda-toolkit.
• cuFFT Library, (2018), http://docs.nvidia.com/cuda/cufft.
• Flores, L. A., Vidal, V., Mayo, P., Rodenas, F., & Verdu, G., (2014), Parallel CT Image Reconstruction Based on GPUs, Radiation Physics and Chemistry, 247, p250.
• Gonsalves, R. A., (1982), Phase Retrieval and Diversity in Adaptive Optics, Optical Engineering, 21, p829. [https://doi.org/10.1117/12.7972989]
• Labeyrie, A., (1970), Attainment of Diffraction Limited Resolution in Large Telescopes by Fourier Analysing Speckle Patterns in Star Images, A&A, 85, p87.
• Li, X. B., Wang, F., Xiang, Y. Y., et al , (2014), Parallel Image Reconstruction for New Vacuum Solar Telescope, JKAS, 47, p43. [https://doi.org/10.5303/jkas.2014.47.2.43]
• Li, X. B., Liu, Z., Wang, F., et al , (2015), High-Performance Parallel Image Reconstruction for the New Vacuum Solar Telescope, PASJ, 67, p47. [https://doi.org/10.1093/pasj/psv018]
• Li, X. B., & Zheng, Y. F., (2016), A Novel Parallel Method for Speckle Masking Reconstruction Using the OpenMP, JKAS, 157, p162.
• Lohmann, A.W., Weigelt, G., & Wirnitzer, B., (1983), Speckle Masking in Astronomy: Triple Correlation Theory and Applications, Applied Optics, 4028, p4037.
• Paxman, R. G., Schulz, T. J., & Fienup, J. R., (1992), Joint Estimation of Object and Aberrations by Using Phase Diversity, JOSAA, 9, p1072. [https://doi.org/10.1364/josaa.9.001072]
• Pehlemann, E., & von der Lühe, O., (1989), Technical Aspects of the Speckle Masking Phase Reconstruction Algorithm, A&A, 216, p337.
• Shi, Z., Xiang, Y. Y., Deng, H., Ji, K. F., & Wei, S. L., (2015), A Method of Level 1 Frames-Selection Based on GPU for New Vacuum Solar Telescope, Chinese Science Bulletin, 1408, p1413.
• von der Lühe, O., (1993), Speckle Imaging of Solar Small Scale Structure. I – Methods, A&A, 268, p374.
• Wöger, F., & von der Lühe, O., (2008), KISIP: A Software Package for Speckle Interferometry of Adaptive Optics Corrected Solar Data, SPIE, 7019, p70191E.
• Wöger, F., (2012), Accelerated Speckle Imaging with the ATST Visible Broadband Imager, SPIE, 8451, p84511C-1.

### Figure 1.

Flow chart of the new parallel speckle masking algorithm based on CUDA on GPGPU.

### Figure 2.

The significant part of the pseudo code for subimage reconstruction using speckle masking based on CUDA.

### Figure 3.

a) One subimage of one subimage burst of 100×2562 pixels before reconstruction. (b) The parallel reconstruction subimage using speckle masking based on CUDA on GPGPU. (c) and (d) are the fields from which (a) and (b) are taken.

### Figure 4.

Subimage reconstruction time for a single core CPU serial program, the OpenMP-based method, and the CUDA-based method for different data sets: set1=100×2562 pixels, set2=125×2562 pixels, set3=150×2562 pixels, set4=175×2562 pixels, set5=200 × 2562 pixels.

### Table 1

Specifications of the host PC

Host hardware specification
Processor Intel Xeon E5-2620
Clock speed 2.00 GHz
Memory 32 GB DDR3
Operating system Red Hat 6.2

### Table 2

Specifications of GPU device

GPU specification
Model NVIDIA Tesla C2050
Number of processor cores 448
Processor core clock 1.15 GHz
Device memory 3 GB
Memory clock 1.50 GHz
CUDA compute capability 7.0

### Table 3

Runtime and speedup performance measurements for one subimage burst of 100 × 2562 pixels between various modules.

Module Serial program (s) OpenMP (s) CUDA (s) Speedup
OpenMP CUDA
Batched 2D FFT 0.6 0.6 0.02 1 30
Batched 2D spectrum shift 0.35 0.04 0.1 8.75 3.5
Batched 2D spectrum conjugate 0.1 0.02 5
Average 4D bispectrum calculation 5.1 1.1 0.3 4.63 17
Subimage reconstruction 6.78 2.72 0.72 2.49 9.42