Given an 'initial' coding sequence, this script generates a series of derived alleles. The program is not intended to generate a null hypothesis or to test models of molecular evolution. It is intended to generate relatively simple mock-data sets for use in laboratory exercises for undergraduate students. The model assumes that the mutation rate is constant across an entire coding sequence; however, the ratio of nonsynonymous to synonymous substitutions can be differentially specified for different regions of the coding sequence. The input sequence (initial) must be in FASTA format with no comment (';') lines and can represent DNA or RNA. The header line (designated with '>') should only contain alpha-numeric characters, underscores, periods, or dashes...avoid spaces (use underscores) and characters that your OS may interpret as having special meaning (e.g. '$', '%', '#', '!', '@', '(', ')', ':', ';', "'", '"', etc.). The entire sequence must be continuous coding sequence (no introns or untranslated regions) so try to use assembled CDSs, cDNAs or mRNA sequences. Such processed sequence is usually available for your favorite gene from NCBI, EMBL or species specific data resources. Nucleic acid sequences must be in direct orientation, in-frame and may not contain any 'STOP' codons (even the terminal STOP). In order for your 'initial' sequence to be identified by MolSelectGen.py, the file should end with either the extensions '.fas' or '.FAS' (e.g. my_sequence.fas). To run this script, you must have the Python 2.3+ core-distribution (http://www.python.org) installed on your machine. To run the script Unix(R) and Unix(R)-like OSs (including MacOS X, but this is untested): 1. Put the script in a directory in your PYTHONPATH; 2. Create a project directory. Place your ancestral FASTA file, and a copy of 'codon_tables.txt' in said directory and launch the script by migrating to said directory and typing 'MolSelectGen.py' at the command prompt. Alternatively: 1. Create a project directory. Place a copy of this script, your ancestral FASTA file, and a copy of 'codon_tables.txt' in said directory. Launch the script by migrating to said directory and typing './MolSelectGen.py' at the command prompt. To run the script under Windows-2000/XP/2003: 1. Create a project directory. Place a copy of this script, your ancestral FASTA file, and a copy of 'codon_tables.txt' in said directory. Launch the script by double clicking the script's icon. Alternatively (if you're feeling lucky): 1. Put the script in a directory in your PYTHONPATH; 2. Create a project directory. Place your ancestral FASTA file, and a copy of 'codon_tables.txt' in said directory and launch the script by migrating to said directory and typing 'MolSelectGen.py' at the command prompt. The script outputs two (2) files in your working directory: 1. 'alleles.fas' contains the alleles you have generated. This is a multiFASTA DNA file. 2. 'log.txt' which charts how each allele was derived and how many new mutations have been introduced into each new allele. RUNNING MolSelectGen.py: 1. The directory that you will run your project in should initially contain three (3) files: codon_tables.txt; initial.fas; MolSelectGen.py. Note that 'initial.fas' can be named anything that you desire provided it ends in '.fas' or '.FAS'. 2. Launch MolSelectGen.py as discussed previously. If running the program by 'clicking' in Windows and, possibly MacOS, a commandline (terminal) window will open automatically. 3. MolSelectGen.py should deliver the following message: Identified 'initial' sequence, initial.fas. Opening and processing... Initial sequence recognized, read and processed without errors. 4. The program will then ask the user to define parameters as follows: >MolSelectGen.py Displays< I. SELECT NUMBER OF ALLELES TO GENERATE: Enter integer for the number of alleles you wish to generate (e.g. 25): >cursor< >You type an integer and press 'return': Enter integer for the number of alleles you wish to generate (e.g. 25): 25 >rtn< >WHAT THIS INTERACTION MEANS: You have just specified the number of alleles you want to create. >MolSelectGen.py Displays< II. DEFINE NUMBER OF GENERATIONS: Enter integer for the number of generations required to generate a 'new' allele (e.g. 20): >cursor< >You type an integer and press 'return': Enter integer for the number of generations required to generate a 'new' allele (e.g. 20): 20 >rtn< >WHAT THIS INTERACTION MEANS: Every time a new allele is to be generated an sequence is selected from the existing pool of alleles. If '20' generations are selected, the new allele (which is initially identical to the selected allele) is read by the progam 20 times; each time the user specified mutation rate is applied to every codon, and mutated accordingly. Over the course of the 'generations', mutations accumulate. >MolSelectGen.py Displays< III. SET THE PROPORTION OF NONSYN to SYN SUBSTITUTIONS: Now, you will determine the rate(s) of nonsynonymous to synonymous substitutions in your 'evolved' alleles. By starting at codon '1' and ending at codon '241', you will select a constant rate across the sequence. Regions that are unspecified will default to a rate of '1.0' (e.g. neutral evolution). Remember, 1 codon equals 3 nucleotides. Your sequence contains 241 codons. Please indicate the START POSITION of your region (enter an integer or type 'done' if finished entering regions): >cursor< >You type an integer and press 'return': Your sequence contains 241 codons. Please indicate the START POSITION of your region (enter an integer or type 'done' if finished entering regions): 33 >rtn< >MolSelectGen.py Displays< Your sequence contains 241 codons. Please indicate the END POSITION of your region (enter an integer): >cursor< >You type an integer and press 'return': Your sequence contains 241 codons. Please indicate the END POSITION of your region (enter an integer): 77 >rtn< >MolSelectGen.py Displays< Enter a floating point value for target nonsynonymous to synonymous ratio (e.g. 0.001 = constraint, 1.000 = neutrality, >> 1.000 positive selection). Please limit to 3 significant digits: >cursor< >You type an floating point value and press 'return': Enter a floating point value for target nonsynonymous to synonymous ratio (e.g. 0.001 = constraint, 1.000 = neutrality, >> 1.000 positive selection). Please limit to 3 significant digits: 1.25 >return< >MolSelectGen.py Displays< Your sequence contains 241 codons. Please indicate the START POSITION of your region (enter an integer or type 'done' if finished entering regions): >cursor< >If you are finished defining how different regions will 'evolve', type 'done' and >rtn< or else begin the process again: Your sequence contains 241 codons. Please indicate the START POSITION of your region (enter an integer or type 'done' if finished entering regions): done >rtn< >WHAT THIS INTERACTION MEANS: You have defined the nonsynonymous to synonymous substitution ratio for a given region or regions. If you want a constant target ratio across the entire coding region start with codon 1 and end with the final codon. Otherwise, when specifing different ratios for different regions, make certain that the regions DO NOT OVERLAP (MolSelectGen.py is smart enough to catch such an error, but it's best to avoid it). The ratio you specify (as a proportion) corresponds to your target Ka/Ks or dN/dS for the excerise in question. >MolSelectGen.py Displays< IV. SPECIFY MUTATION RATE: Enter an integer for the target mutation rate (codon changes per 10,000 codons per generation; e.g. 5): >cursor< >You type an integer and press 'return': Enter an integer for the target mutation rate (codon changes per 10,000 codons per generation; e.g. 5): 2 >rtn< >WHAT THIS INTERACTION MEANS: Mutation Rate- the program expects a somewhat non- traditional mutation rate as its purpose is not to provide truly random (neutral) mutations, but rather to allow the instructor to stack the odds for a particular outcome. Here the mutation rate is defined as the number of codon changes per 10,000 codons per generation. This is roughly equivalent to the number of nucleotide substitutions per 30,000 nucleotides per generation. Thus if your starting sequence is 241 codons long (723 nucleotides), you specify a mutation rate of 2 codons per 10,000 codons per generation, and you specify 20 generations you expect roughly 241*20*(2/10000) = ~1 new mutation per new allele. >MolSelectGen.py Displays< V. SPECIFY EXTINCTION RATE: Enter a probability that one of your existing alleles will become extinct at the start of each mutation cycle (0.0 means everyone survives, 1.0 means everyone will become extinct; suggested value 0.05 or lower): >cursor< >You type a floating point value and press 'return': Enter a probability that one of your existing alleles will become extinct at the start of each mutation cycle (0.0 means everyone survives, 1.0 means everyone will become extinct; suggested value 0.05 or lower): 0.05 >rtn< >WHAT THIS INTERACTION MEANS: This is just away to allow some alleles to go 'extinct', thereby not contributing further to your resulting phylogeny and possibly resulting in more interesting trees and datasets. >MolSelectGen.py Displays< A whole bunch of stuff related to the project, then ... Writing multiFASTA file (alleles.fas) with all alleles... Writing log file. Job Complete! Remember, the file 'log.txt' contains all the information from the project including the parameters that you have specified. The authors encourage users to 'play' with the parameters to find the ideal settings for their lab/course. Questions? Comments? Suggestions? Contact Aaron at ampoo_twibbit@hotmail.com, or Alice at ashumate@fdu.edu. Include 'MolSelectGen' in the subject line. Thanks for trying this program. Tootles. Support OpenSource software. It's socially responsible.