Cart  |  Log In

Focus: Sequence Analysis

Using Simple Macros to Perform Routine Sequence Analysis and Manipulation in Microsoft® Word

Here we provide some simple tips and guidelines for creating your own macros to perform routine sequence analysis tasks using Microsoft® Word.

By Ethan Straus, Ph.D.
Promega Corporation

Published in September 2006

Introduction

Bioinformatics is a relatively young rapidly growing discipline with many new papers being published and training programs starting. New algorithms, programs and techniques are being introduced at a rapid pace to allow more complex, faster and novel analysis of very large data sets, prediction of novel properties of nucleic acids and proteins, and mining of vast quantities of text for scientific research. These rapid advances can be overwhelming and give the impression that a person must have powerful tools and significant training to do any sequence manipulation and analysis. While that is probably true if you wish to do cutting edge bioinformatics research, much simple sequence analysis and manipulation is actually easy to do. If fact, you can use standard word processing programs for a significant amount of this type of work.

This article will focus on the use of Microsoft® Word (MS Word), but the same principles apply to other Microsoft® Office products such as Excel, and they probably apply to most word processors. What follows has been tested and works with Microsoft® Word 2003 (with Service Pack 2) and in general works with other versions of MS Word, but may need to be adjusted to suit your specific environment.

General Guidelines for Use of Word Processors for Sequence Analysis

There are a few things to keep in mind when using MS Word for sequence analysis. First, be sure to save a copy of any important data prior to making any changes or running any macro for the first time. If your manipulations do not work as intended, you can always go back to the original data. Also, keep in mind that word processing programs generally save files in specialized formats that include a lot of information in addition to the actual text of the document. If you want to transfer the results of your work to sequence analysis programs you will probably need to "Save As..." a .txt file before trying to import the data into another program. Alternately, most sequence analysis programs will accept data copied directly from a word processing program and pasted into the sequence analysis program. Today, word processors give you access to a huge number of fonts. Most of these fonts are proportional, which means that different characters get different amounts of space. For most sequence display, a monospaced font, which makes each character (including spaces) exactly the same width, is preferable. I use Courier New, but any monospaced font will work (Table 1).

Table 1. Proportional and Monospaced Fonts
DNA and its complement in a proportional font (Verdana)
GCATACATTTTTTTTTTTTTTTCGACGACTAATC
CGTATGTAAAAAAAAAAAAAAAGCTGCTGATTAG
DNA and its complement in a monospaced font (Courier New)
GCATACATTTTTTTTTTTTTTTCGACGACTAATC
CGTATGTAAAAAAAAAAAAAAAGCTGCTGATTAG

Useful Inherent Functions of MS Word

The "Find" (Edit menu/Find or ctrl-f) and "Replace" (Edit menu/Replace or ctrl-h) functions of MS Word are easy to use, readily available, and quite powerful for a number of different analyses. In a simple example, if you have a DNA sequence and you wish to find an EcoRI site, just use “Find” and search for "GAATTC". This will work well as long as the sequence contains only DNA sequence (e.g., no spaces, numbers, tabs, returns that interrupt the sequence). If you have a site containing ambiguties, such as HinfI site (GANTC), you have to deal with the ambiguous "N". The “Find” dialog box contains a “More” button. If you click that and then look under "Special", you will see an option for "Any Letter", which puts "^$" into the Find box. So, the following search pattern "GA^$TC" will find an HinfI site. There are a number of other special characters that can be very help helpful (Table 2). Unfortunately, there is no way to search a site with more limited ambiguity (for instance AvaII recognizes the site “GGWCC”, where W is the IUPAC ambiguity code for either A or T). You would have to use sequence analysis software, write a macro or just search for "GG^$CC" and then check out the specific sites manually. You can also check "Use wildcards" to have access to a number of other options, but this turns off the special character recognition. Please see Microsoft Office Online for more details.

Table 2. Useful special codes for MS Word "Find and Replace" Function.
Code Finds

^$
^#
^p
^t
^?
 
Any letter
Any numeral
paragraph mark
Tab (can be very useful in reformatting text from Excel)
Any single character

Finding things in text is helpful, but unless you have an easy way to get rid of any extraneous spaces and numbers, it does not work very well. That is where "Replace" comes in. You can replace anything you can find and, importantly, you can replace it with nothing. So, if you open the Find and Replace window and replace all spaces, paragraphs ("^p") and numerals ("^#") with a blank replace box you will generally end up with a clean sequence ready for searching. See below for instructions on combining all of these replacements into a single macro, allowing you to click only one button.

There are other neat things you can do with “Replace”. Let's say you want to see all of the C's which would be methylated by dam methylation (any C in "GATC"). You could do the straightforward Find and Replace (i.e., search for "GATC" replace with "GATC*"), or you could take advantage of more advanced Find and Replace options. For instance, you can search for "GATC" and color the found text. First, make sure that the "More" button shows all the Find and Replace options. With the cursor in the Replace box, click on "Format", choose "Font" and choose a color. If there is a format specified in the Replace box, but no text, any text found has that format applied to it. If you click “Replace All”, all instances of "GATC" will be colored red (Figure 1).

thumbnail-Figure 1. The Find and Replace dialog box from MS Word 2003.
Figure 1. The Find and Replace dialog box from MS Word 2003.

Figure 1. The Find and Replace dialog box from MS Word 2003.

Notice also that when you perform a Find and Replace task, the program tells you how many items it found (Figure 2).

This can be a quick way to count some feature of interest. I have calculated %GC by searching and replacing first Gs then Cs then As then Ts and noting the number of replacements along the way. This might seem like a pain, but it can be easier than finding your sequence analysis program, importing the sequence, remembering where the %GC function is and then using it.

thumbnail-Figure 2. The results window from a Find and Replace task.
Figure 2. The results window from a Find and Replace task.

Figure 2. The results window from a Find and Replace task.

Macros

Recording Macros for Repetitive Activities

A macro is a new command created by combining a number of existing ones. Macros generally simplify repetitive tasks. You can make macros in MS Word in two ways, either by recording a set of standard commands or by direct scripting in Visual Basic. The simpler of these two is recording commands. It can be very useful if you have a number of commands that you frequently do at the same time. A very good example is clean up of sequence data that have been pasted into MS Word from GenBank® or a similar source. To record this as a macro, follow these steps:

  1. Have an MS Word document open. Make a back up of anything valuable before you start.
  2. Select Tools> Macro>Macros>Record New Macro.
  3. In the popup window that appears, give your macro a name (e.g., "SequenceCleanup"). Note that spaces are not allowed in macro names. Click "OK".
  4. "Select All" (Either ctrl-a or from the Edit menu). Note: All of the Replace steps will give you a message saying how many were replaced. Simply click "OK"; these messages do not interfere with your macro.
  5. Replace (ctrl-h or from the Edit menu) spaces with nothing.
  6. Replace returns ("^p") with nothing.
  7. Replace numerals ("^#") with nothing. Stop Recording (Tools> Macro>Macros> Stop Recording).

You have now recorded a very useful macro. Of course, you can modify it to suit your needs. Perhaps replace "U" with "T" to change RNA to DNA or change font to Courier, whatever you like.

You can use this macro by opening Tools>Macro>Macros, selecting "SequenceCleanUp" then "Run". It will clean up the text in the open document. Later I will discuss how to make this macro available in all documents and how to add it to a toolbar or a menu command.

Writing Your Own Macros

In addition to recording macros, you can write them yourself. While this method requires more knowledge of how macros work, it also allows you to increase functionality and carry out conversions such as translation of a DNA sequence to a protein sequence or reverse complementation of a DNA sequence.

Important information about macros:

Let's write a simple Macro to start. We will make one that reverses the order of the characters in the selected text. This can be quite useful when dealing with DNA.

First a few commands:

For example, consider the following code:

For X =1 to 19
Do Something With X
Next

This code will begin a loop in which the value of X is 1 the first time the loop is executed. X increases by one each time through the loop, and the loop does not continue past X = 19.

You can control how much the variable changes by adding "Step" to specify the size of the increase. The end of the loop is designated with the word "Next". Table 3 contains code to reverse the order of text in a selection. This code can be pasted directly into the Visual Basic Editor main window to create a macro.
 

Table 3. Example Code for Reversing the Order of Text in a Selection
Macro Text Comment
Sub Reverse () Start the macro. If you type this you will automatically get a line with "End Sub".
Dim StartingText
Dim EndingText
Lets the program know that these will be variables.
StartingText = Selection.Text Assigns the text that was selected when the macro was started to the variable "StartingText".
For CurrentLocation = Len(StartingText) To 1 Step -1 This begins a "For...Next" loop. The initial value of the variable "CurrentLocation" is the length of the "StartingText", and its value will decrease by 1 every time the loop is run.
EndingText = EndingText + Mid(StartingText, CurrentLocation, 1) Adds a single character to EndingText. The character added depends on the value of "CurrentLocation".
Next Ends the loop. The program will go back to the start until "CurrentLocation" = 1.
Selection.Text = EndingText Replaces the selected text with the value in "EndingText", which is the reversed string.

End Sub
 
Ends the macro.

Using essentially the same structure you can expand and begin to do more useful things. If you replace each character in the reversing macro we just wrote with its complement, you have generated the reverse complement of the sequence rather than just reversing it. You can do that with a series of "if...then" statements or with a "Select Case" statement. Both allow you to have the computer do different things depending on the value of a variable.

Format for an “if ....then” statement:

"if" Statement to evaluate "then"
Code to execute if the statement is true
"End if"

You may add an "else" clause to occur if the statement is not true.
For example, consider the following code:

if X > 12 Then
Y = X * X
else
Y = X* X * X
End If

This code will check to see if X is greater than 12. If it is, it will assign a value of X2 to Y. If not, Y is assigned a value of X3.

“Select Case” is similar to “if...then”, but allows you to examine a number of different possible values in a single statement. The format is as follows:

Select Case Value to be checked
Case "possible value1"
Code to execute in this case
Case "possible value2"
Code to execute in this case
Case "possible value3"
Code to execute in this case
Case "possible value4"
Code to execute in this case
Case Else
Code to execute if the value to be checked didn't fit any of the possibilities listed
End Select

For example, consider the code:

Select Case ThisBase
Case "A"
EndString = EndString & "T"
Case "T"
EndString = EndString & "A"
Case "C"
EndString = EndString & "G"
Case "G"
EndString = EndString & "C"
Case Else
EndString = EndString & LCase(ThisBase)
End Select

This code looks at the value of the variable "ThisBase" and adds the complement of that value to a variable called "EndString". If "ThisBase" is not a base (e.g., not G, A, T or C), this program doesn't know what to do with it. I don't want to just get rid of it because it may be important. So, in order to make it easily visible, the macro adds the lowercase version of that character to the string. This code will correctly get the complement of a DNA sequence as long as the sequence is upper case.

*Note: In the "SelectCase" example above, the ampersand "&" joins two pieces of text together. "ThisBase" is the name of a variable which contains the current DNA base value. "EndString" is a variable which starts empty, but receives the growing DNA sequence." Lcase(text)" changes the text to lower case.
 

So, putting this together with what we have already done we can write a macro to generate the reverse complement of the selection:

Sub RevComp()
Dim StartString
Dim EndString
Dim ThisBase
StartString = UCase(Selection.Text) ' Ucase changes all characters to upper case.
For BasePosition = Len(StartString) To 1 Step -1
ThisBase = Mid(StartString, BasePosition, 1)
Select Case ThisBase
Case "A"
EndString = EndString & "T"
Case "T"
EndString = EndString & "A"
Case "C"
EndString = EndString & "G"
Case "G"
EndString = EndString & "C"
Case Else
EndString = EndString & LCase(ThisBase) 'Lcase changes characters back to lower case.
End Select
Next
Selection.Text = EndString
End Sub

*Note:"Selection.Text" contains the text value of whatever is selected when the macro starts running.
Remember, if there is a command you don't know, the "Help" available within Visual Basic Editor works pretty well.

There is one more function that I find to be extremely useful for these types of macros. That is "InStr". It is the "In String" function that is designed to find the position of one string within another. I frequently use this function just to check for the presence of one string within another.

Its format is:

InStr [position to start from within the searched string (optional), string to be searched, string to search for, type of comparison (I almost always use "vbTextCompare")]


Note that "InStr" gives you the location where your searched for string begins or zero if it is not found.

The following code checks if “ThisChar” is a legitimate DNA base and adds one to the value of the variable "Bases" if it is.

If InStr(1, "GATC", UCase(ThisChar), vbTextCompare) > 0 Then
Bases = Bases + 1
End If

Macro Administration

Once you have written and recorded a set of useful macros, you probably want to be able to access them easily, and you might want to be able to give other people access to them. Some parts of this are quite easy and some are harder.

In order to have macros available from all of your MS Word documents and to make them available for others:

  1. Write or record macros and get them all working in a single MS Word document
  2. Save that document as a Document Template (.dot).
  3. Move that .dot document into your Word Startup Folder. (If you don't know where that is, check Tools>Options>File Locations )

To make the Macros available for others, just have them put this .dot file into their Word Startup Folders. To get the macros to run from buttons on your toolbar, perform the following actions:

  1. Right click on a toolbar.
  2. Choose "Customize".
  3. Make sure the "Commands" tab is selected.
  4. Under "Categories" scroll down to “Macros”.
  5. Find the macro that you want to use. (Note that macros have long names here that are hard to read. You may need to poke around a bit).
  6. Drag and drop it where you want it. You will get a button which has the full name of the macro.
  7. To change the name or give it an icon, right click a toolbar and choose "Customize".
  8. Then right click the button you wish to change. You can change the text, assign icons and so forth.
Note that all macros in Visual Basic must have paired parentheses in their definition line. This is to allow input parameters, but in many cases there are none, so the parentheses are empty.

You can write a Macro to build a new toolbar and assign it to things automatically. You can also get that macro to run automatically by naming it "AutoExec()". Getting a toolbar to work is tricky. You first need to create the toolbar then create each button. Here is an example:

Sub AutoExec()
'Create a toolbar which is represented by the variable "MyBar"
Set myBar = CommandBars _
.Add(Name:="BioInformatics", Position:=msoBarFloating, _
Temporary:=True) 'Make the toolbar
myBar.Visible = True 'Make sure it is visible
myBar.Left = 800 'Position it horizontally
myBar.Top = 50 'Position it vertically
'Done making the toolbar

'Make a button to go on the toolbar
Set RevCompButton = myBar.Controls _
.Add(Type:=msoControlButton)
'Make the button run the macro "RevComp" when it is clicked
RevCompButton.OnAction = "RevComp"
'Set the tooltip text for the button
RevCompButton.TooltipText = "Reverse Complement Selection"
RevCompButton.FaceId = 1977 'Set the appearance of the button.
'Done making the button
End Sub

Summary

MS Word has the ability to do a lot of simple but useful DNA sequence analysis. This article has provided a brief overview of some of the things that are possible. It should be possible to create a macro that can perform essentially any sequence analysis, but for most complex analyses a program specialized for that purpose is likely to do the job more quickly and efficiently. However, for much of the analysis that needs to be done on an everyday basis, a macro in MS Word can be a simple, fast and intelligent option.

Appendix

Table 4. Visual Basic Reference Sheet
Command/Symbol What it does Example
& Concatenates two strings. That is, it will join two pieces of text together. "Hello" & "World" = "Hello World"
If the variable greeting = "Hello" then
greeting & "World" = "Hello World"
chr(number) Returns the character represented by the number. Chr(13) is a return
Chr(11) is a line break
See VB-Helper.com or the Visual Basic Editor help for a more comlete list.
' Makes everything after it on the line a comment. X = 3.2 '3.2 has been found imperically to work well for X
Selection.text Represents selected text in the document allowing the text to be manipulated by functions. Selection.text = Ucase(Selection.text)
This would just change the selected text to upper case.
Dim Defines a variable. Dim Molarity creates a variable named Molarity.
For...Next A loop which runs with a counter. Please see the Visual Basic Editor help for details.
InStr Finds the location of one string within another. Please see the Visual Basic Editor help for details.
vbTextCompare A comparison used with InStr. Please see te Visual Basic Editor help for details. There is probably no need to understand this command for these types of macros. Just use it....
Mid Finds a substring with a specific starting and ending location. Mid("GATC", 3, 2) finds the third position in "GATC" and returns the next 2 characters "TC"
Ucase (variable) Change all characters in a variable to upper case. Ucase ("variable") = VARIABLE
Lcase (variable) Change all the characters in a variable to lower case. Lcase ("VARIABLE") = variable
nesting functions You can nest functions inside each other. Lcase(Mid(Ucase("variable"),3,2)) = ri