Tutorials in Protein Crystallography

Tutorials in Protein Crystallography

The purpose of this module is to provide an introduction to the specific requirements of protein crystallography and to give the students their first exposure to crystallographic computing.

At the end of the module, you will be able to prepare crystals, process data and solve structures by molecular replacement.

Co-ordinator: Muhammed Sayed

Week 1:

Lecture: Protein purification for crystallization purposes:

  • Aims and requirements
  • Background to methods: Affinity, Ion-exchange, HIC and gel-filtration
  • Linking chromatography techniques
  • Towards the optimal purification protocol

Lecture: Protein crystallization

Basic theory, physical methods and strategy involved in the crystallisation of biological macromolecules including factors affecting crystallization

Lecture: X-ray diffraction, data collection and processing (2 Lectures)

Broad overview of topics covered by Susan and Trevor plus:

  • Crystal architecture, symmetry and space group
  • Basic theory covering diffraction of X-rays by protein crystals
  • X-rays and its sources including synchrotron radiation
  • Crystal treatment and cryo-crystallography
  • X-data collection and detectors
  • X-ray data processing and analysis: Introduction to DENZO, analysing log files from DENZO and Scalepack
  • Assessing quality of data – what to look for

Week 2:

Tutorial/practicals: Process a single native data set and determine space group

  • Auto-indexing and Integration using DENZO – Mike’s native data set
  • Scaling and merging data with Scalepack
  • Analyse output as outlined in Mike’s Tutorial
  • Spacegroup determination

Lecture: Molecular Replacement

  • Basic concepts of MR method
  • Intro to Patterson functions
  • Definition of Rotation and Translation Functions
  • Factors affecting MR solution
  • Examples using the automated program Amore

Lecture: Protein Crystallographic Refinement

  • General aspects of Refinement
  • The crystallographic R-factor
  • Over-determination of the refinement problem
  • Constrained and Restrained Refinement
  • Over-fitting (R-free)
  • Very basic principles of least-squares refinement and Maximum Likelihood

Week 3:

Tutorials/practicals: Molecular Replacement using CK2 data

  • Solved structure with 4 dimers in the asymmetric unit using Automated Molecular Replacement program MOLREP in CCP4
  • Refinement of structure using Refmac5 in CCP4
  • Manual rebuilding/fitting in Xtalview


Assessment of this module will take place at the end of the Protein Crystallography module in October.

Tutorial: X-ray data processing

Mike Lawrence

PsaA (pneumococcal surface antigen A) is a extracellular ABC-type transporter protein involved in transporting zinc and/or manganese into the Gram-positive Streptococcus pneumonia (Lawrence et al. 1998, Pilling et al., 1998). The molecular weight of the protein is about 35 kDa. Oscillation data have been collected from needle-like crystals of the protein grown from via phosphate precipitation. The aim of this tutorial is to use this data set to illustrate the basics of oscillation data processing.

In order to do this tutorial you will need access to a computer containing the data and the HKL program suite (Otwinowski & Minor, 1997), as well as to a for this software.

Part 1. Inspection of the oscillation images


The data set consist of 98 one-degree oscillation images collected on a Rigaku R-axis IV image plate detector mounted on a laboratory rotating anode X-ray source. The data set is stored in the directory $PSAA_DATA/native and the respective file names are of the form psa13###.osc, where ### is a three-digit number ranging from 001 to 098.


Individual images can be displayed on your computer using the HKL program xdisp as follows:

xdisp raxis4 100 psa13###.osc

Try this with one of the images. Then acquaint yourself with the various modes of display, in particular colour / mono, altering the contrast (colour is most useful), zooming in. Note that you can zoom in right down to the pixel level and see the actual counts recorded for each pixel.

These images have rather high background and you may need to alter the contrast range to see the diffraction spots at different resolution ranges.


  1. Estimate the likely resolution of the data, given that the source is a copper anode, the detector to image plate distance is 120 mm, and the pixel size of the R-Axis IV detector is 100 microns.
  2. What is the likely source of the three shadows that appear in the image ?

Part 2. Determination of the unit cell (auto-indexing)


The aim of "indexing" is to assign a Miller index (hkl value) to every spot in the diffraction pattern. This can be done in a semi-automatic fashion (as opposed to earlier manual methods), hence the term "auto-indexing". HKL auto-indexing proceeds via a Fourier-space type analysis of the layout of the diffraction spots, effectively looking for the most common spacings between the spots and seeing if these can be assigned in a consistent way to the same Bravais lattice.

Auto-indexing is done by the HKL program denzo.

denzo needs some critical information about the diffractometer set-up used to collect the diffraction data – these are provided in the control file auto.dat – this file is located in the directory $PSAA_DATA/native. This file has the following components in it:-

Parameters relating to the X-ray system
  1. the type of detector used to collect the image
  2. the direction of the camera oscillation axis with respect to the X-ray beam
  3. the wavelength of the X-rays used
  4. the distance between the crystal and the image plate detector
  5. any rotation of the image plate detector with respect to the direction of the x-ray beam.
  6. the likely coordinates of the direct x-ray beam
Parameters relating to the data collection itself
  1. the number of degrees encompassed by the oscillation
  2. the file name of the image
  3. the likely mosaic spread of the crystal
  4. information about the approximate size of the spots in pixels
Parameters relating to the autoindexing procedure
  1. the maximum resolution to which auto-indexing should work (this should be the maximum data resolution unless the pattern is very complicated)
  2. a rough estimate of the maximum unit cell dimension
  3. the extent to which weak spots should be included in the initial cell refinement
  4. to start the auto-indexing and to put denzo into auto-indexing mode, denzo must also be supplied with the name of a file (typically peaks.file) containing the x-y coordinates of the strongest spots in the image.

Auto-indexing will produce the following information:

  1. the likely Bravais lattice of the crystal
  2. the dimensions of the crystal unit cell axes (a,b,c) and the size of the angles between these axes (alpha, beta, gamma)
  3. the orientation of the the unit cell axes with respect to the camera axial system

This information can then be used to predict (via Bragg's law) which reflections will occur on the image and where they will be located on the image. The information can also be used to predict the partiality of each reflection, i.e. the extent to which the reflection is fully recorded on the image – effectively the extent to which it has totally crossed the Ewald sphere during the oscillation sweep. (Remember that the spots have a finite, non-zero angular width as a result of the mosaic spread of the crystal).

Once the lattice is assigned, the entire geometry of the system (cell dimensions, crystal orientation etc.) can be refined by a process of minimizing the overall discrepancy between the position of spots predicted by the Ewald sphere construction and their actual location on the image plate. In particular, this procedure (termed "refinement") leads to very accurate values for the

  1. the cell dimensions
  2. the crystal orientation with respect to the X-ray axial system
  3. the camera positioning and rotation with respect to the X-ray beam
  4. certain parameters relating to the beam divergence (crossfire)
  5. the camera to detector distance (if the resolution is high)

Denzo does not refine the mosaic spread of the crystal or select the correct Bravais lattice to describe the data, this has to be done manually. The mosaic spread is correctly set when there is an optimal balance between the number of reflections predicted and the number of reflections observed.


  1. Display the first image in the data set and and use xdisp to pick some of the stronger spots (aim to get at least 50).
  2. Check the auto.dat file - make certain that it reflects the data collection environment.
  3. Run denzo on the command line by typing denzo
  4. Type in @auto.dat

    This will execute the auto.dat script and should display a list of the fit between the each Bravais lattice and the observed set of peaks. The correct lattice will usually be that of highest symmetry that still gives a reasonable fit (say less than 2-3 % error).

  5. At this stage kill the program (Ctrl-C) and enter any space group of the correct lattice into the auto.dat file by adding the following line before the "peak file" command:
    spacegroup P2
    (for example if the Bravais lattice is determined to be monoclinic). Re-start denzo and feed in the modified auto.dat file via the @ command as above. Note that at this stage auto-indexing is only providing the lattice. Determination of the correct space group will take place later. At this stage it is useful to enter a spacegroup that does not contain screw operators.
  6. The selected cell should now appear beneath the list of possible lattices, together with the cell dimensions, the coordinates of the beam centre and the crystal orientation angles.
  7. Then do some refinement of the geometry. Type in
    fit cell crystal rotx roty rotz x beam y beam cassette rotx roty
    write predictions

    This set of commands will the do six cycles of refinement of the cell dimensions, the crystal orientation settings, the beam coordinates and the camera rotation.
    Examine the chi2 values for the spot positions and for the partiality estimates (these should be less than about 2.0 if the auto-indexing is correct and the diffraction pattern is well-defined). These numbers are found in lines in the output of the form

    position 373 chi**2 x 1.90 y 1.70 pred. decrease 0.000 x 373 = 0.000
    partiality766 chi**2  1.02 pred.decrease 0.000 x 766 = 0.000 
  8. Go back to the display and do update predictions. You should see the predicted pattern displayed. If all is well it should be a very close match to the observed pattern. Check also carefully to see whether there are systematically too many or too few spots – this would also indicate that the selected lattice was incorrect.
  9. Zoom in and examine the spot size around a spot of medium intensity. See if this needs adjusting. If so, the spot dimensions and the background dimensions can be changed in the auto.dat file. Note that you can interrogate individual spots in the zoom window by using the mouse key to obtain their hkl value. Scan along the lattice lines interrogating spots and see how the indices vary.
  10. Check the mosaic spread. You may need to increase or decrease this parameter, depending on whether or not the width of the diffraction lunes matches that predicted. Simply enter the command mosaicity 0.5 say to change the mosaicity to 0.5 degrees, and re-do some fitting via the go command.


  1. What is the correct Bravais lattice to describe the PsaA crystal lattice?
  2. What are the refined cell dimensions of the crystal?
  3. How many space groups are compatible with this lattice ? List them.
  4. What are the refined crystal orientation parameters?
  5. What are the refined coordinates of the beam centre?
  6. What is your estimate of the mosaic spread of the crystal?
  7. What is the effect on the fit of the predicted pattern to the observed pattern of incorrectly selecting a lattice of (a) lower symmetry (e.g. P1) and (b) higher symmetry (e.g. I432).
  8. What is the optimal spot size parameters ? Do these hold at all resolution ranges?
  9. Perform auto-indexing on the second image as well. For consistency it is best to describe this image as having started at an oscillation angle 1 degree beyond that of the first image. What cell dimensions and crystal rotation parameters are obtained for this image? Comment on any differences with that obtained for the first image.

Part 3. Integration of the diffraction intensities


Once auto-indexing is complete, the refined geometry values can then used to predict the reflections that occur image by image by image through the entire data set. The intensity of each reflection is then measured using the so-called profile fitting method. Profile fitting involves determining the average shape of the stronger reflections and then fitting this shape to each reflection in turn – the observed intensity of each reflection will then be proportional to the scale factor that has to be applied to this shape in order to scale it to the particular reflection intensity. Profile fitting is a superior technique to simple integration (pixel-by-pixel summation) of the spot intensity, particularly in the case of weak reflections.

The result of this procedure will be a set of files of the form psa13###.x, these contain the profile-fitted intensities of each reflection occurring on each particular image, as well as further information relating both to the position of the spot and its degree of partiality. At the end of each *.x file is a summary of the refined parameters for the detector and crystal as determined from that particular image.


  1. Edit the refine.dat file to include the results of the auto-indexing. It should now contain the best estimates of the cell parameters, the crystal orientation parameters, the beam centre, the mosaic spread and the camera orientation parameters. The refine.dat file is very similar to the auto.dat file and also contains instructions to perform refinement of parameters on every image. However, it does not have the peaks file command in it.
  2. Perform the progressive refinement and integration using the command
    denzo < refine.dat | tee refine.log

    The results are conveniently written to a log file as well as to the standard output via the tee pipe fitting. If xdisp is left running at the same time as denzo, then the image display will be updated each time denzo progresses to the next refinement cycle or to the next image.

    If the nothing unforeseen has happened to the crystal during data collection and the pattern remains strong from image to image then the refinement should proceed without problem. A quick scan of the log file should be able to check that there is no major increase in the difference between the predicted and actual spot positions as the refinement progresses from image to image. A scan of the "*.x" files will reveal whether or not there is any drift in cell dimensions or crystal orientation or beam position, all of which would be indicative of a problem in the refinement. Such checks can readily be constructed with a grep command.


  1. What is the range of cell dimensions obtained across all images? Comment on the variation.
  2. What is the maximum deviation between the fitted and predicted spots values across the range of images? Is this acceptable?

Part 4. Scaling and merging all the data into a single data set


This is the most critical part of the process and involves combining all the data to give a set of average intensities for each unique reflection in the asymmetric unit of the point group. Within HKL, the process also provides information about the space group itself, as well as highly accurate cell dimensions and valuable statistics relating to the quality of the entire data set. The key steps undertaken during the scale and merge process are as follows

  1. For each reflection add the partially-recorded intensity values on sequential images to obtain the complete intensity (fully-recorded value) for that reflection in that location.
  2. Cross-compare symmetry-related reflections on different plates in order to establish the relative scale of each image with respect to a reference image. This scale has two components – a multiplicative constant scale factor S and an exponential factor dependent on resolution (a B-value).
  3. Obtain a mean value for each unique reflection based on all observed intensities for that reflection and it symmetry mates across all images.
  4. Reject all measurements that are deemed to be outliers compared to other measurements.
  5. Re-refine the cell dimensions and the crystal setting parameters in order to give the best possible prediction of the observed partiality of each reflection on each image (this is termed post-refinement). If the data set is a high-quality one, it is also possible to refine the mosaic spread.
  6. Obtain estimates of the error (s(I)) of the mean intensity (I) of each reflection, as well as statistics to describe the overall quality of the data set (R-factor, <I/s>) as a function of resolution.
  7. Assess the completeness of the data set as a function of resolution.
  8. Examine the intensity of axial reflections in order to discern the presence of screw axes, and hence potentially define the space group. Note that the hand of the space group cannot be discerned by this process. This can be done by selecting a space group with screw axes – scalepack will then list the intensities of all forbidden axial reflections (provided that these reflections were included in the integration process and hence in the *.x files).

The scaling, merging and post-refinement is carried out by the HKL program scalepack. scalepack is supplied with the set of *.x files and processes these file via the above procedures to produce a single output file with individual intensities for each unique reflection in the Laue group.


Examine the file scale1.in and see how it reflects the scaling process and the data.

Edit it to include the best estimate of the available resolution as well as the selected space group.

(In this case it is best to provide scalepack with a space group that contains as many screw axes as possible as it will then provide information about the intensities of forbidden reflections. Once the correct space group is determined from this information, select it and go back and run scalepack with the correct space group)

To run scalepack: scalepack < scale1.in > scale1.log

Examine the log file carefully and note where each of the above processes are undertaken and the statistics they produces in the process. Note the rejection of outlier measurements – these are written to a file called reject. You need to completely understand the output of this program.

Then re-run scalepack as follows: scalepack < scale2.in > scale2.log

scale2.in should be identical to scale1.in except for the first line, which instructs scalepack to read in the rejection file produced by the first run. The idea here is that more accurate scale factors can be obtained once the outliers are rejected before scaling takes place. This process can then be repeated as desired until the scale factors become stable and no further outliers are rejected. If large variation is seen in the cell or crystal setting parameters or mosaicity, then it may be necessary to re-run the integration step, supplying denzo with the post-refined values for these parameters.


  1. What is the likely cause of the variation in scale factor from image to image?
  2. Would one expect the same set of cell dimensions and crystal rotation parameters from image to image?
  3. What is the overall redundancy (multiplicity) of this data set, i.e. on average how measurements are averaged per reflection?
  4. Explain what is meant by chi-squared in this context.
  5. Explain what is meant by the merging R-factor for a particular frame and the merging R-factor for the entire data set
  6. What is the mean merging R-factor for this data set?
  7. How many measurements are rejected after three runs of scalepack?
  8. What is the post-refined mosaic spread of the data
  9. Why does the merging R-factor increase as the frame number increases?
  10. What is the correct space group for this data set?