SPIDER logo

SPIDER: Random Info

Occasional Thoughts about SPIDER, etc.



15 April 2014     ArDean Leith

The Future of EM Software.

Both Science and C&E News have acknowledged the current 'revolutionary advance' in cryo-electron microscopy single-particle reconstruction.

These advances in resolution of reconstructions use new direct electron capture cameras and publications that I have seen utilize Relion software for the reconstruction.

I am still uncertain how much of the improved resolution arises from the improved software. At issue are not only the reconstruction methodology but also the resolution metric.

If Relion is a significant source of the improvement then there arises a question of the future role of other softwares in reconstruction. Currently Relion is able to handle most of the reconstruction pathway except for particle selection (windowing) and initial reference model construction.

These other softwares include: SPIDER, EMAN2, SPARX, Xmipp, IMAGIC, Bsoft, and SIMPLE and some others. These softwares still contain some capabilities not found in Relion. e.g.

With the exception of these capabilities what is the future function of these softwares? Will they survive Relion's ascent? How much future development should be done on them? What will be the impact on funding for software other than Relion?

EM Software development funding by NIH in the US is currently in a rather bad state. Both SPIDER and IMOD and its associated software have lost major or all of their funding. At NIH almost all software development grants, for widely different purposes, compete directly and also compete with funding for various biological databases. This lack of targeting leads to poor quality reviewing.

E.g. In the case of SPIDER one of three reviewers of our most recent grant application stated:

"the number of investigators employing SPR is limited and not expected to grow substantially".
It is difficult for me to see how a knowledgeable reviewer could come to such a conclusion in the midst of a 'revolutionary advance'.

There does not appear to be any viable non-grant mechanism for the continued maintenance of scientific "Free Open Source Software". Is it reasonable to hope that researchers will direct voluntary monetary donations to software developers as some have suggested? Can researchers even get such a contribution approved by their local grant administrators? Do their auditors OK such an unobligated contribution? There are additional problems with currency conversion. Certainly the red tape involved in both donating and accepting a donation conspire against this idea. Up until now most software development has existed as sort of a side-operation of previously fairly well funded EM labs, in our case a 'NIH research resource'. Such funding is increasingly at risk and long-term development and maintenance of software is disappearing.

This uncertainty in funding confounds discussion on the future of EM software. Where do we go from here? Do you see continued use of SPIDER and other softwares?

 

29 Nov. 2012     ArDean Leith

Why no one should use MRC image stacks (IMO).

A single particle reconstruction from cryo-EM images of non-symmetrical objects often requires 100,000 --> 1,000,000 images. If such a large number of images are stored in most common Linux filesytems, accession / addition of images will cause thrashing of the filesytem and extemely slow access. This occurs not just in processes accessing the images but throughout all access to that file system.

To overcome this thrashing one can purchase an expensive parallel file storage system (e.g. from Panasas) or more commonly aggregate the images into 'stacks', or a less commonly into a database. Most EM softwares support some sort of file based stack. Several different EM single particle reconstruction softwares support both MRC and SPIDER format files to various extents.

The MRC stack file format is an especially poor choice for your stacks. There is a single 1024 byte header for the whole stack, then individual images are concatenated into the stack without any image specific header..

Problems


4 Sept. 2012     ArDean Leith

Interpolation and Improved Reconstruction Resolution

We recently introduced improved interpolation using FBS inside several SPIDER operations. We have shown that FBS gives significant improvements over the linear and quadratic interpolation used in SPIDER previously and is as good as the much slower gridded interpolation available in SPARX.

During refinement of a reference based reconstruction interpolation is used at four steps. These are: creation of reference images from an existing reference volume, application of existing alignment parameters to the experimental images, conversion of image rings to polar coordinates, and alignment of images prior to back projection into a volume.

When we modified our recommended procedure for refinement grploop.pam using the FBS interpolation alternatives in SPIDER and tested the refinement step using actual cryo-em data we were perplexed to find a small but repeatable decline in reconstruction resolution of an overall refinement step.

We investigated this decline using a ribosome data set consisting of four sets of noisy experimental images taken at different defocus levels containing over 6000 images. The decrease in resolution is caused by the application of existing alignment rotation and translations to the experimental images, before these images are compared to the reference projections for determination of the best matching pairs. The 'RT SQ' operation uses quadratic interpolation which adds an asymmetric filter effect to the results. This filtration ended up cutting noise in the aligned experimental images so that they gave better choice of matching reference images. Poorer interpolation gave a better outcome! But this observation pointed to a method of improving the refinement step. We have added a option to denoise the experimental images prior to the reference comparison in the 'AP SHC' operation. We evaluated Fourier lowpass, averaged box convolution, median box convolution, mean shift denoising, and anisotropic diffusion denoising before settling on Fourier lowpass filter as giving the best resolution results.

We have modified our recommended refinement procedure to use FBS interpolation in: 'PJ 3F' for the creation of the reference projections, 'AP SHC' during application of existing alignment parameters to the experimental images, and in 'RT SF' for creating the view used for backprojection. We also used FBS interpolation during conversion of images rings to polar coordinates. These improvements which are present in grploop.pam gave a significant improvement in resolution over the course of a complete refinement series compared to our previous procedure.


29 Aug. 2012     ArDean Leith

Fourier-based Spline Interpolation

We have developed a 2D and 3D Fourier-based Spline Interpolation Algorithm (FBS) in order to improve the performance of rescaling, rotation, and conversion from Cartesian to polar coordinates. In order to interpolate a two- or three-dimensional grid we use a particular sequential combination of correspondingly two and three 1D cubic interpolations with Fourier derived coefficients. A 1D cubic interpolation is a third degree polynomial:

Y(X)=A0 + A1*X + A2*X2 + A3*X3

where polynomial coefficients A0, A1, A2, and A3 are calculated from the Fourier transform of the image:

A0 = Y(0)
A1 = Y'(0)
A2 = 3(Y(1) -Y(0) - 2Y'(0) - Y'(1)
A3 = 2(Y(0) -Y(1)) + Y'(0) + Y'(1)

The derivatives at grid nodes were obtained using well-known relation between Fourier transforms of the derivative and the Fourier transform itself:

F((d)f(x,y)/(d)x) = i*2*pi*k*F(k,l)

where F(k,l) is a coefficient of discrete Fourier transform series F(f(x,y))

This allows us to calculate derivatives in any local point without a finite difference approximation involving the data from neighboring points.

We compared FBS to other commonly used interpolation techniques, quadratic interpolation and convolution reverse gridding (RG). A rotation of images by FBS interpolation takes roughly 1.1-1.5 as long as quadratic interpolation, but achieves dramatically better accuracy. The accuracy of FBS interpolation is similar to RG interpolation. However, FBS rotation is approximately 1.4-1.8 times faster than RG. FBS algorithm combines the simplicity of polynomial interpolation and ability to preserve high spatial frequency. Currently it has been incorporated into several operations in the open source package SPIDER for single-particle reconstruction.


9 Mar. 2011     ArDean Leith

Optimization

Since hardware speeds are stagnant or decreasing there is increased interest in optimizing SPIDER's processing speed. Since SPIDER is a general purpose EM imaging package this means different things to different users. Locally the biggest time demand for our single particle reconstructions is alignment of images with reference projections (SPIDER operations: 'AP SH' and 'AP REF'). In order to access effect of changes in compiler options I used the operation: 'AP SHC' which is the latest highly 'tweaked' version of 'AP SH'). Usual data was a set of 375x375 pixel images and a comparison of 50 experimental images versus 550 references.

Compiler choice
We have access to both PGI and Intel Fortran compilers. I choose to use the PGI compilers because the Intel compiler produces poorly optimized executables for use on AMD Opteron hardware. The PGI compiled executables work well both on Intel and AMD hardware. The results reported here are using the current release for PGI compiler: Release 11.1).

Optimization Level
Aggressive optimization with PGI -O3 gives 3-4% speedup on the benchmark code. However this optimization level can only be used with great care. Some SPIDER operations give erroneous results with this compilation. This is probably due to differences in the execution order of statements and is a problem with floating point data that can potentially have wide variations in absolute value of the numbers. Changing order of arithmetic operations like subtract and divide can sometimes affect accuracy of the output. Thus use of -O3 can only be justified with carefull testing. Code for operation: 'AP SH' is mostly compiled at level O3 now following such extensive testing. Most non-alignment operations are compiled with level -O2.

Kieee FLag
Since SPIDER was ported to Linux from SGI I have always used the PGI flag -Kiee which says to strictly use IEEE conventions inside mathematical operations. Originally I used this in order to get same results from code compiled with PGI as with results from SGI code. PGI says this flag may slow operation but I am surprised to find that it increases speed of my benchmark by as much as 8%. Since it is also presumed to be more accurate, including use of this flag is a no-brainer.

Inlining Subroutines
Inlining subroutines/functions is expected to increase speed. There is less overhead stacking current subroutine data when invoking a called function. However in my benchmark it has negative effect on speed, slowing operation as much as 10%. Since inlining is also dangerous as it is tricky ensuring that the inlined code is kept in sync with the actual latest source, inlining is not helpfull.

Compiling for Large data
PGI compilers have flags -mcmodel=medium, -Mlarge_arrays which affect ability of the executable to handle large static data and large dynamically allocated data (typical of some operations which import large files of data). Depending on how SPIDER is used (particularly if inline/incore files are defined) some sites require the ability to handle these large files. The executables distributed with SPIDER have usually been compiled with -mcmodel=medium for handling large static arrays. Benchmarking shows that this has a insignificant impact on executable speed.

Compiling Static vs Dynamic Executables
Statically compiled executables do not require the presence of certain PGI or system libraries at execution time. In return the executable is larger than a dynamically linked executable. SPIDER has usually been distributed with static executables. My benchmark shows no difference in speed for these two types of executables. Since static executables have far less installation problems over varied Linux distributions and ages I have always preferred this option.

Compiling for use with OpenMP
PGI compilers have flag .-mp for creation of code that utilizes OpenMP parallization on suitable hardware. The executables distributed with SPIDER have been compiled with this flag for 20 years. Using all 12 cores of a dual-hexcore AMD Opteron gives 905% speedup over a single process on my benchmark.
Compiling For NUMA
AMD Opterons should support NUMA (Non-uniform memory architecture) execution when used on multi-processor hardware. PGI compilers have flag mp=numa that would utilize this capability when inside OpenMp. My benchmark shows no difference in speed for executables compiled with/without this flag on a dual-hexcore AMD Opteron compute node. Since use of this flag also requires dynamic executables it is not used in our distributed executables.

Compiling for use with SSE SIMD Vectorization
PGI and Intel compilers have flags e.g.-fastsse which allows optimization for use with SSE SIMD. This vectorization increases speed on suitable hardware. The executables distributed with SPIDER have been compiled with this flag for several years.

Compiling with Interprocedural analysis
PGI compilers have flag -ipa allows optimization across procedural boundaries. This may increase speed. My benchmark shows no difference in speed for executables compiled with/without this flag. However I am not certain that the compiler applies this analysis when source code is in different files so it may not have been a complete test of this option.

Compiling with Older Compiler Releases
The executables distributed with SPIDER have been created with PGI Release 8.6 for several years. This was done because this release had good support for creation of static executables. Release 11.1 now supports quality static executable creation and will be used in further distributions. I see no significant speed increase in executables with the newer release but they allow use of newer Fortran 2003 conventions which are valuable in coding.


30 Sep. 2010     ArDean Leith

CUDA SnakeOil

Question:
Alignment is the major time step in creating a EM single particle reconstruction and is easily parallelized with many different schemes. Why aren't GPU's more usefull in alignment of images during EM single particle reconstruction? What is the hold up? These techniques have been available for five years now and are common in other fields.
Answer:
News reports and anecdotes about the tremendous speed increases coming from application of graphic programming units (GPU's), usually involving Nvidia and CUDA, to computing tasks have created unrealistic expectations. For some problems GPU's offer great improvement. However for some easily parallelizable problems such as alignment they lack utility. Some of the claims about the use of GPU's can even be characterized as 'snake oil'.

Nvidia GPU's vary in their compute capability and the amounts of three different types of memory which have critical influence on how a problem can be approached. In addition alignment tasks usually take more than 5 minutes of GPU time which means that the GPU can not currently be shared with graphics. Thus there must be a dedicated GPU (often a Tesla/Fermi board).

Computer science publications and anecdotes commonly report speed-ups as the increase in speed of the parallelized portion of the application over speed on a single processor. In usual reconstructions (e.g. realistic ribosome reconstructions) significant time is required to read images from disk. Such input typically occupies 3-10% of the time during an alignment. If only 4% of the time is spent loading the largest possible overall speed-up is 25X. 100X is impossible overall. Another trick is to report speed-ups from a cluster of GPU enabled compute nodes, sometimes with multiple GPU's per processor.

SPIDER and other single particle reconstruction software usually have high optimized alignment operations, commonly using OpenMP or MPI. Alignment speed as tested on our dual-hexcore computer scales very well with increased number of cores (11X). Few computers today have a single core and a usefull speed-up should be defined in comparison to a reasonable computer setup not versus speed on a single core.

In EM single particle reconstruction from reference projections using programs such as SPIDER, there is a vast range of different practical applications. The number of experimental images(x), number of reference images(y), and the size of the images(z), can vary over orders of magnitudes. E.g. x=200-10,000 experimental images; y=80-5000 reference images, z=50x50 - 480x480 pixels.

The gold standard for alignment is still exhaustive search within a translation/rotation space and the alignment is usually implemented with Fourier space cross-correlation of polar images. The common algorithm has an excess of ways that the processing can be parallelized. A naive implementation on a GPU seldom results in more than a 2X speed-up. Only by tedious tuning the transfer of data within the GPU among the different memories can a speed-up of 12-20X be achieved. However a small change in the x, y, x variables mentioned above, or a change of compute capability in the GPU can completely negate the speed-up resulting in even poorer performance than without a GPU. Such a change requires a new implementation.

It is probably possible to create implementations that will give 12-20X speed-ups for any specific set of x,y,z and hardware. However a general implementation giving such speed-up is currently impossible. Multiple (10-20?) implementations will be needed for each hardware and the logic to select the implementation is complex. Each implementation requires substantial programming effort.

Currently reported alignment implementions admit that there have been unreported changes (degredationss) in search algorithms or severe restrictions on various parameters. One report gives a rotational alignment resolution of only 6 degrees. Such a restriction makes the implementation useless on images greater than 100 pixels.

We can provide a single implementation in SPIDER that can give a 16X speed-up for specific small range of parameters. However the overhead required to do so including instructions on how to interact with 9 different run-time libraries for FFT, BLAS, and NVIDIA make even this minimally usefull implementation painfull. When compared to a run on a dual-hexcore computer this is really only about an effective speed-up of 1.5X!

Currently my advice is to carefully evaluate multi-core computers versus GPU enabled computers. Only if you have a extremely heavy compute load involving a single set of x,y,z parameters would it be worthwhile to go to a GPU solution. Then you will need software that is capable of handling your specific problem parameters. Otherwise split the problem among standard multi-core compute nodes. It probably will not be much more expensive to do so. If you still need increased speed invest in a parallel filesystem for enhanced disk access (e.g. Panasas disk array).

This recommendation may change in the future and I will revisit this subject when I get access to the new Tesla GPU and the newly announced CUDA 4.0.


6 Mar. 2009     ArDean Leith

While getting ready to retire a bunch of old SGI MIPS based servers and workstations, I wondered how much faster our current AMD Opteron 64 bit Linux boxes are than our trusted old machines of 5-10 years ago. Benchmark table.


11 Feb. 2009     ArDean Leith

If you are using a Beowulf type cluster for parallel execution of time consuming operations during single particle reconstruction, there are three common methods of parallelizing discussed on our website. Since the iterative alignment and defocus group backprojection steps typically consume more than 98% of the compute time and are trivially parallelizable by defocus group, we commonly use a SIMPLE PubSub script for distributing jobs to different compute nodes. Other sites have their own scripts to handle the distribution. However if you have a inexpensive cluster with SIMPLE Ethernet networking this method has a large inefficiency when there are many nodes accessing a single storage disk or SIMPLE RAID array on a file server using NFS mounts from the compute nodes.

When many compute nodes attempt to access a single disk (or RAID array) using NFS there is a significant slowdown in overall through put. There is a lot of effort currently to overcome this problem with various methods e.g. Parallel NFS. However if your compute nodes include adequate local storage on all the nodes there is a SIMPLE solution that may improve through-put. At the beginning of a compute node computation, copy all the files that are accessed to the local disk with a systems call, then carry out the computations. At the end of the compute nodes processing, copy any altered files back to the file server.

We have recently altered the scripts that we use during the projection matching step of 3D Reconstruction so that pub_refine.pam and its associated procedures (especially pub_refine_start.pam) handle the cloning of the necessary files on local compute nodes and the transmission back to the server at the end of the processiong on the compute nodes.

On our compute cluster this modification is very productive. The speed increase will of course depend on the number of simultaneous processes, and the pattern of disk access.


Source: random.html     Page updated: 1 Aug. 2014     ArDean Leith


© Copyright Notice /       Enquiries: spider@wadsworth.org