Focus: Sequence Analysis
Using Simple Macros to Perform Routine Sequence Analysis and Manipulation in Microsoft® Word
Here we provide some simple tips and guidelines for creating your own macros to perform routine sequence analysis tasks using Microsoft® Word.
By Ethan Straus, Ph.D.
Promega Corporation
Published in September 2006
Introduction
Bioinformatics is a relatively young rapidly growing discipline with many new papers being published and training programs starting. New algorithms, programs and techniques are being introduced at a rapid pace to allow more complex, faster and novel analysis of very large data sets, prediction of novel properties of nucleic acids and proteins, and mining of vast quantities of text for scientific research. These rapid advances can be overwhelming and give the impression that a person must have powerful tools and significant training to do any sequence manipulation and analysis. While that is probably true if you wish to do cutting edge bioinformatics research, much simple sequence analysis and manipulation is actually easy to do. If fact, you can use standard word processing programs for a significant amount of this type of work.
This article will focus on the use of Microsoft® Word (MS Word), but the same principles apply to other Microsoft® Office products such as Excel, and they probably apply to most word processors. What follows has been tested and works with Microsoft® Word 2003 (with Service Pack 2) and in general works with other versions of MS Word, but may need to be adjusted to suit your specific environment.
General Guidelines for Use of Word Processors for Sequence Analysis
There are a few things to keep in mind when using MS Word for sequence analysis. First, be sure to save a copy of any important data prior to making any changes or running any macro for the first time. If your manipulations do not work as intended, you can always go back to the original data. Also, keep in mind that word processing programs generally save files in specialized formats that include a lot of information in addition to the actual text of the document. If you want to transfer the results of your work to sequence analysis programs you will probably need to "Save As..." a .txt file before trying to import the data into another program. Alternately, most sequence analysis programs will accept data copied directly from a word processing program and pasted into the sequence analysis program. Today, word processors give you access to a huge number of fonts. Most of these fonts are proportional, which means that different characters get different amounts of space. For most sequence display, a monospaced font, which makes each character (including spaces) exactly the same width, is preferable. I use Courier New, but any monospaced font will work (Table 1).
| DNA and its complement in a proportional font
(Verdana) GCATACATTTTTTTTTTTTTTTCGACGACTAATC CGTATGTAAAAAAAAAAAAAAAGCTGCTGATTAG |
| DNA and its complement in a monospaced font
(Courier New) GCATACATTTTTTTTTTTTTTTCGACGACTAATC CGTATGTAAAAAAAAAAAAAAAGCTGCTGATTAG |
Useful Inherent Functions of MS Word
The "Find" (Edit menu/Find or ctrl-f) and "Replace" (Edit menu/Replace or ctrl-h) functions of MS Word are easy to use, readily available, and quite powerful for a number of different analyses. In a simple example, if you have a DNA sequence and you wish to find an EcoRI site, just use “Find” and search for "GAATTC". This will work well as long as the sequence contains only DNA sequence (e.g., no spaces, numbers, tabs, returns that interrupt the sequence). If you have a site containing ambiguties, such as HinfI site (GANTC), you have to deal with the ambiguous "N". The “Find” dialog box contains a “More” button. If you click that and then look under "Special", you will see an option for "Any Letter", which puts "^$" into the Find box. So, the following search pattern "GA^$TC" will find an HinfI site. There are a number of other special characters that can be very help helpful (Table 2). Unfortunately, there is no way to search a site with more limited ambiguity (for instance AvaII recognizes the site “GGWCC”, where W is the IUPAC ambiguity code for either A or T). You would have to use sequence analysis software, write a macro or just search for "GG^$CC" and then check out the specific sites manually. You can also check "Use wildcards" to have access to a number of other options, but this turns off the special character recognition. Please see Microsoft Office Online for more details.
| Code | Finds |
|---|---|
^$ ^# ^p ^t ^? |
Any letter Any numeral paragraph mark Tab (can be very useful in reformatting text from Excel) Any single character |
Finding things in text is helpful, but unless you have an easy way to get rid of any extraneous spaces and numbers, it does not work very well. That is where "Replace" comes in. You can replace anything you can find and, importantly, you can replace it with nothing. So, if you open the Find and Replace window and replace all spaces, paragraphs ("^p") and numerals ("^#") with a blank replace box you will generally end up with a clean sequence ready for searching. See below for instructions on combining all of these replacements into a single macro, allowing you to click only one button.
There are other neat things you can do with “Replace”. Let's say you want to see all of the C's which would be methylated by dam methylation (any C in "GATC"). You could do the straightforward Find and Replace (i.e., search for "GATC" replace with "GATC*"), or you could take advantage of more advanced Find and Replace options. For instance, you can search for "GATC" and color the found text. First, make sure that the "More" button shows all the Find and Replace options. With the cursor in the Replace box, click on "Format", choose "Font" and choose a color. If there is a format specified in the Replace box, but no text, any text found has that format applied to it. If you click “Replace All”, all instances of "GATC" will be colored red (Figure 1).
Notice also that when you perform a Find and Replace task, the program tells you how many items it found (Figure 2).
This can be a quick way to count some feature of interest. I have calculated %GC by searching and replacing first Gs then Cs then As then Ts and noting the number of replacements along the way. This might seem like a pain, but it can be easier than finding your sequence analysis program, importing the sequence, remembering where the %GC function is and then using it.
Macros
Recording Macros for Repetitive Activities
A macro is a new command created by combining a number of existing ones. Macros generally simplify repetitive tasks. You can make macros in MS Word in two ways, either by recording a set of standard commands or by direct scripting in Visual Basic. The simpler of these two is recording commands. It can be very useful if you have a number of commands that you frequently do at the same time. A very good example is clean up of sequence data that have been pasted into MS Word from GenBank® or a similar source. To record this as a macro, follow these steps:
- Have an MS Word document open. Make a back up of anything valuable before you start.
- Select Tools> Macro>Macros>Record New Macro.
- In the popup window that appears, give your macro a name (e.g., "SequenceCleanup"). Note that spaces are not allowed in macro names. Click "OK".
- "Select All" (Either ctrl-a or from the Edit menu). Note: All of the Replace steps will give you a message saying how many were replaced. Simply click "OK"; these messages do not interfere with your macro.
- Replace (ctrl-h or from the Edit menu) spaces with nothing.
- Replace returns ("^p") with nothing.
- Replace numerals ("^#") with nothing. Stop Recording (Tools> Macro>Macros> Stop Recording).
You have now recorded a very useful macro. Of course, you can modify it to suit your needs. Perhaps replace "U" with "T" to change RNA to DNA or change font to Courier, whatever you like.
You can use this macro by opening Tools>Macro>Macros, selecting "SequenceCleanUp" then "Run". It will clean up the text in the open document. Later I will discuss how to make this macro available in all documents and how to add it to a toolbar or a menu command.
Writing Your Own Macros
In addition to recording macros, you can write them yourself. While this method requires more knowledge of how macros work, it also allows you to increase functionality and carry out conversions such as translation of a DNA sequence to a protein sequence or reverse complementation of a DNA sequence.
Important information about macros:
- The Visual Basic Editor can be opened from Tools>Macro>Visual
Basic Editor.
When the editor is opened you will see a variety of sections on the screen depending on how it is configured. If it is not obvious where the macro goes, choose "Code" from the View menu and use that section of the screen for code. With Visual Basic Editor you can write macros from scratch or you can first record one (as above) and then modify it to suit your needs. - The Help available within the Visual Basic Editor is pretty good. If you know the name of a command, it will generally give sufficient detail to allow you to use the command.
- All macros start with the word "Sub", followed by the macro name and matched parentheses. The matched parentheses are required and allow you to pass information into the macro.
- All Macros end with the words "End Sub".
- Any line which starts with a single quote "'" is a comment and is ignored by the computer.
- The text of whatever is selected, highlighted using the cursor, when you start running the macro is available to the computer as "selection.text". You can see what is in "selection.text" and you can replace what is in "selection.text".
- A collection of Visual Basic script hints are available in the Appendix
Let's write a simple Macro to start. We will make one that reverses the order of the characters in the selected text. This can be quite useful when dealing with DNA.
First a few commands:
- Dim
Stands for "Dimension". This command informs the computer that you will be using a variable, and it can also specify variable type. - Mid
This command specifies characters at a given location within a piece of text. - For...Next
This is a type of loop that has a value that changes every time the loop is executed.
The format is "For" variable name "=" initial value "to" final value.
For example, consider the following code:
For X =1 to 19
Do Something With X
Next
This code will begin a loop in which the value of X is 1 the first time the loop is executed. X increases by one each time through the loop, and the loop does not continue past X = 19.
You can control how much the variable changes by adding "Step" to
specify the size of the increase. The end of the loop is designated with the
word "Next". Table 3 contains code to reverse the order of text in a
selection. This code can be pasted directly into the Visual Basic Editor
main window to create a macro.
| Macro Text | Comment |
|---|---|
| Sub Reverse () | Start the macro. If you type this you will automatically get a line with "End Sub". |
| Dim StartingText Dim EndingText |
Lets the program know that these will be variables. |
| StartingText = Selection.Text | Assigns the text that was selected when the macro was started to the variable "StartingText". |
| For CurrentLocation = Len(StartingText) To 1 Step -1 | This begins a "For...Next" loop. The initial value of the variable "CurrentLocation" is the length of the "StartingText", and its value will decrease by 1 every time the loop is run. |
| EndingText = EndingText + Mid(StartingText, CurrentLocation, 1) | Adds a single character to EndingText. The character added depends on the value of "CurrentLocation". |
| Next | Ends the loop. The program will go back to the start until "CurrentLocation" = 1. |
| Selection.Text = EndingText | Replaces the selected text with the value in "EndingText", which is the reversed string. |
End Sub |
Ends the macro. |
Using essentially the same structure you can expand and begin to do more useful things. If you replace each character in the reversing macro we just wrote with its complement, you have generated the reverse complement of the sequence rather than just reversing it. You can do that with a series of "if...then" statements or with a "Select Case" statement. Both allow you to have the computer do different things depending on the value of a variable.
Format for an “if ....then” statement:
"if" Statement to evaluate "then"
Code to execute if the statement is true
"End if"
You may add an "else" clause to occur if the statement is not true.
For example, consider the following code:
if X > 12 Then
Y = X * X
else
Y = X* X * X
End If
This code will check to see if X is greater than 12. If it is, it will assign a
value of X2 to Y. If not, Y is assigned a value of X3.
“Select Case” is similar to “if...then”, but allows you to examine a
number of different possible values in a single statement. The format is
as follows:
Select Case Value to be checked
Case "possible value1"
Code to execute in this case
Case "possible value2"
Code to execute in this case
Case "possible value3"
Code to execute in this case
Case "possible value4"
Code to execute in this case
Case Else
Code to execute if the value to be checked didn't fit any of the possibilities listed
End Select
For example, consider the code:
Select Case ThisBase
Case "A"
EndString = EndString & "T"
Case "T"
EndString = EndString & "A"
Case "C"
EndString = EndString & "G"
Case "G"
EndString = EndString & "C"
Case Else
EndString = EndString & LCase(ThisBase)
End Select
This code looks at the value of the variable "ThisBase" and adds the complement of that value to a variable called "EndString". If "ThisBase" is not a base (e.g., not G, A, T or C), this program doesn't know what to do with it. I don't want to just get rid of it because it may be important. So, in order to make it easily visible, the macro adds the lowercase version of that character to the string. This code will correctly get the complement of a DNA sequence as long as the sequence is upper case.
- *Note: In the "SelectCase" example above, the ampersand "&" joins two pieces of text together.
"ThisBase" is the name of a variable which contains the current DNA base
value.
"EndString" is a variable which starts empty, but receives the growing
DNA sequence." Lcase(text)" changes the text to lower case.
So, putting this together with what we have already done we can write a macro to generate the reverse complement of the selection:
Sub RevComp()
Dim StartString
Dim EndString
Dim ThisBase
StartString = UCase(Selection.Text) ' Ucase changes all characters to upper case.
For BasePosition = Len(StartString) To 1 Step -1
ThisBase = Mid(StartString, BasePosition, 1)
Select Case ThisBase
Case "A"
EndString = EndString & "T"
Case "T"
EndString = EndString & "A"
Case "C"
EndString = EndString & "G"
Case "G"
EndString = EndString & "C"
Case Else
EndString = EndString & LCase(ThisBase) 'Lcase changes characters back to lower case.
End Select
Next
Selection.Text = EndString
End Sub
- *Note:"Selection.Text" contains the text value of whatever is selected when
the macro starts running.
Remember, if there is a command you don't know, the "Help" available within Visual Basic Editor works pretty well.
There is one more function that I find to be extremely useful for these types of macros. That is "InStr". It is the "In String" function that is designed to find the position of one string within another. I frequently use this function just to check for the presence of one string within another.
Its format is:InStr [position to start from within the searched string (optional), string to be searched, string to search for, type of comparison (I almost always use "vbTextCompare")]
Note that "InStr" gives you the location where your searched for string begins or zero if it is not found.
The following code checks if “ThisChar” is a legitimate DNA base and adds one to the value of the variable "Bases" if it is.
If InStr(1, "GATC", UCase(ThisChar), vbTextCompare) > 0 Then
Bases = Bases + 1
End If
Macro Administration
Once you have written and recorded a set of useful macros, you probably want to be able to access them easily, and you might want to be able to give other people access to them. Some parts of this are quite easy and some are harder.
In order to have macros available from all of your MS Word documents and to make them available for others:
- Write or record macros and get them all working in a single MS Word document
- Save that document as a Document Template (.dot).
- Move that .dot document into your Word Startup Folder. (If you don't know where that is, check Tools>Options>File Locations )
To make the Macros available for others, just have them put this .dot file into their Word Startup Folders. To get the macros to run from buttons on your toolbar, perform the following actions:
- Right click on a toolbar.
- Choose "Customize".
- Make sure the "Commands" tab is selected.
- Under "Categories" scroll down to “Macros”.
- Find the macro that you want to use. (Note that macros have long names here that are hard to read. You may need to poke around a bit).
- Drag and drop it where you want it. You will get a button which has the full name of the macro.
- To change the name or give it an icon, right click a toolbar and choose "Customize".
- Then right click the button you wish to change. You can change the text, assign icons and so forth.
- Note that all macros in Visual Basic must have paired parentheses in their definition line. This is to allow input parameters, but in many cases there are none, so the parentheses are empty.
You can write a Macro to build a new toolbar and assign it to things automatically. You can also get that macro to run automatically by naming it "AutoExec()". Getting a toolbar to work is tricky. You first need to create the toolbar then create each button. Here is an example:
Sub AutoExec() 'Create a toolbar which is represented by the variable "MyBar" Set myBar = CommandBars _ .Add(Name:="BioInformatics", Position:=msoBarFloating, _ Temporary:=True) 'Make the toolbar myBar.Visible = True 'Make sure it is visible myBar.Left = 800 'Position it horizontally myBar.Top = 50 'Position it vertically 'Done making the toolbar 'Make a button to go on the toolbar Set RevCompButton = myBar.Controls _ .Add(Type:=msoControlButton) 'Make the button run the macro "RevComp" when it is clicked RevCompButton.OnAction = "RevComp" 'Set the tooltip text for the button RevCompButton.TooltipText = "Reverse Complement Selection" RevCompButton.FaceId = 1977 'Set the appearance of the button. 'Done making the button End Sub
Summary
MS Word has the ability to do a lot of simple but useful DNA sequence analysis. This article has provided a brief overview of some of the things that are possible. It should be possible to create a macro that can perform essentially any sequence analysis, but for most complex analyses a program specialized for that purpose is likely to do the job more quickly and efficiently. However, for much of the analysis that needs to be done on an everyday basis, a macro in MS Word can be a simple, fast and intelligent option.
Appendix
| Command/Symbol | What it does | Example |
|---|---|---|
| & | Concatenates two strings. That is, it will join two pieces of text together. | "Hello" & "World" = "Hello World" If the variable greeting = "Hello" then greeting & "World" = "Hello World" |
| chr(number) | Returns the character represented by the number. | Chr(13) is a return Chr(11) is a line break See VB-Helper.com or the Visual Basic Editor help for a more comlete list. |
| ' | Makes everything after it on the line a comment. | X = 3.2 '3.2 has been found imperically to work well for X |
| Selection.text | Represents selected text in the document allowing the text to be manipulated by functions. | Selection.text = Ucase(Selection.text) This would just change the selected text to upper case. |
| Dim | Defines a variable. | Dim Molarity creates a variable named Molarity. |
| For...Next | A loop which runs with a counter. | Please see the Visual Basic Editor help for details. |
| InStr | Finds the location of one string within another. | Please see the Visual Basic Editor help for details. |
| vbTextCompare | A comparison used with InStr. | Please see te Visual Basic Editor help for details. There is probably no need to understand this command for these types of macros. Just use it.... |
| Mid | Finds a substring with a specific starting and ending location. | Mid("GATC", 3, 2) finds the third position in "GATC" and returns the next 2 characters "TC" |
| Ucase (variable) | Change all characters in a variable to upper case. | Ucase ("variable") = VARIABLE |
| Lcase (variable) | Change all the characters in a variable to lower case. | Lcase ("VARIABLE") = variable |
| nesting functions | You can nest functions inside each other. | Lcase(Mid(Ucase("variable"),3,2)) = ri |
Products may be covered by pending or issued patents or may have certain limitations on use. Please visit our patent and trademark web page for more information.
GenBank is a registered trademark of the U.S. Deptartment of Health and Human Services. Microsoft is a registered trademark of Microsoft Corporation.


