#!/usr/bin/env python """Given an 'initial' coding sequence, this script generates a series of derived alleles. The program is not intended to generate a null hypothesis or to test models of molecular evolution. It is intended to generate relatively simple mock-data sets for use in laboratory exercises for undergraduate students. The model assumes that the mutation rate is constant across an entire coding sequence; however, the ratio of nonsynonymous to synonymous substitutions can be differentially specified for different regions of the coding sequence. The input sequence (initial) must be in FASTA format with no comment (';') lines and can represent DNA or RNA. The header line (designated with '>') should only contain alpha-numeric characters, underscores, periods, or dashes...avoid spaces (use underscores) and characters that your OS may interpret as having special meaning (e.g. '$', '%', '#', '!', '@', '(', ')', ':', ';', "'", '"', etc.). The entire sequence must be continuous coding sequence (no introns or untranslated regions) so try to use assembled CDSs, cDNAs or mRNA sequences. Such processed sequence is usually available for your favorite gene from NCBI, EMBL or species specific data resources. Nucleic acid sequences must be in direct orientation, in-frame and may not contain any 'STOP' codons (even the terminal STOP). In order for your 'initial' sequence to be identified by MolSelectGen.py, the file should end with either the extensions '.fas' or '.FAS' (e.g. my_sequence.fas). To run this script, you must have the Python 2.3+ core-distribution (http://www.python.org) installed on your machine. To run the script Unix(R) and Unix(R)-like OSs (including MacOS X, but this is untested): 1. Put the script in a directory in your PYTHONPATH; 2. Create a project directory. Place your ancestral FASTA file, and a copy of 'codon_tables.txt' in said directory and launch the script by migrating to said directory and typing 'MolSelectGen.py' at the command prompt. Alternatively: 1. Create a project directory. Place a copy of this script, your ancestral FASTA file, and a copy of 'codon_tables.txt' in said directory. Launch the script by migrating to said directory and typing './MolSelectGen.py' at the command prompt. To run the script under Windows-2000/XP/2003: 1. Create a project directory. Place a copy of this script, your ancestral FASTA file, and a copy of 'codon_tables.txt' in said directory. Launch the script by double clicking the script's icon. Alternatively (if you're feeling lucky): 1. Put the script in a directory in your PYTHONPATH; 2. Create a project directory. Place your ancestral FASTA file, and a copy of 'codon_tables.txt' in said directory and launch the script by migrating to said directory and typing 'MolSelectGen.py' at the command prompt. The script outputs two (2) files in your working directory: 1. 'alleles.fas' contains the alleles you have generated. This is a multiFASTA DNA file. 2. 'log.txt' which charts how each allele was derived and how many new mutations have been introduced into each new allele. RUNNING MolSelectGen.py: 1. The directory that you will run your project in should initially contain three (3) files: codon_tables.txt; initial.fas; MolSelectGen.py. Note that 'initial.fas' can be named anything that you desire provided it ends in '.fas' or '.FAS'. 2. Launch MolSelectGen.py as discussed previously. If running the program by 'clicking' in Windows and, possibly MacOS, a commandline (terminal) window will open automatically. 3. MolSelectGen.py should deliver the following message: Identified 'initial' sequence, initial.fas. Opening and processing... Initial sequence recognized, read and processed without errors. 4. The program will then ask the user to define parameters as follows: >MolSelectGen.py Displays< I. SELECT NUMBER OF ALLELES TO GENERATE: Enter integer for the number of alleles you wish to generate (e.g. 25): >cursor< >You type an integer and press 'return': Enter integer for the number of alleles you wish to generate (e.g. 25): 25 >rtn< >WHAT THIS INTERACTION MEANS: You have just specified the number of alleles you want to create. >MolSelectGen.py Displays< II. DEFINE NUMBER OF GENERATIONS: Enter integer for the number of generations required to generate a 'new' allele (e.g. 20): >cursor< >You type an integer and press 'return': Enter integer for the number of generations required to generate a 'new' allele (e.g. 20): 20 >rtn< >WHAT THIS INTERACTION MEANS: Every time a new allele is to be generated an sequence is selected from the existing pool of alleles. If '20' generations are selected, the new allele (which is initially identical to the selected allele) is read by the progam 20 times; each time the user specified mutation rate is applied to every codon, and mutated accordingly. Over the course of the 'generations', mutations accumulate. >MolSelectGen.py Displays< III. SET THE PROPORTION OF NONSYN to SYN SUBSTITUTIONS: Now, you will determine the rate(s) of nonsynonymous to synonymous substitutions in your 'evolved' alleles. By starting at codon '1' and ending at codon '241', you will select a constant rate across the sequence. Regions that are unspecified will default to a rate of '1.0' (e.g. neutral evolution). Remember, 1 codon equals 3 nucleotides. Your sequence contains 241 codons. Please indicate the START POSITION of your region (enter an integer or type 'done' if finished entering regions): >cursor< >You type an integer and press 'return': Your sequence contains 241 codons. Please indicate the START POSITION of your region (enter an integer or type 'done' if finished entering regions): 33 >rtn< >MolSelectGen.py Displays< Your sequence contains 241 codons. Please indicate the END POSITION of your region (enter an integer): >cursor< >You type an integer and press 'return': Your sequence contains 241 codons. Please indicate the END POSITION of your region (enter an integer): 77 >rtn< >MolSelectGen.py Displays< Enter a floating point value for target nonsynonymous to synonymous ratio (e.g. 0.001 = constraint, 1.000 = neutrality, >> 1.000 positive selection). Please limit to 3 significant digits: >cursor< >You type an floating point value and press 'return': Enter a floating point value for target nonsynonymous to synonymous ratio (e.g. 0.001 = constraint, 1.000 = neutrality, >> 1.000 positive selection). Please limit to 3 significant digits: 1.25 >return< >MolSelectGen.py Displays< Your sequence contains 241 codons. Please indicate the START POSITION of your region (enter an integer or type 'done' if finished entering regions): >cursor< >If you are finished defining how different regions will 'evolve', type 'done' and >rtn< or else begin the process again: Your sequence contains 241 codons. Please indicate the START POSITION of your region (enter an integer or type 'done' if finished entering regions): done >rtn< >WHAT THIS INTERACTION MEANS: You have defined the nonsynonymous to synonymous substitution ratio for a given region or regions. If you want a constant target ratio across the entire coding region start with codon 1 and end with the final codon. Otherwise, when specifing different ratios for different regions, make certain that the regions DO NOT OVERLAP (MolSelectGen.py is smart enough to catch such an error, but it's best to avoid it). The ratio you specify (as a proportion) corresponds to your target Ka/Ks or dN/dS for the excerise in question. >MolSelectGen.py Displays< IV. SPECIFY MUTATION RATE: Enter an integer for the target mutation rate (codon changes per 10,000 codons per generation; e.g. 5): >cursor< >You type an integer and press 'return': Enter an integer for the target mutation rate (codon changes per 10,000 codons per generation; e.g. 5): 2 >rtn< >WHAT THIS INTERACTION MEANS: Mutation Rate- the program expects a somewhat non- traditional mutation rate as its purpose is not to provide truly random (neutral) mutations, but rather to allow the instructor to stack the odds for a particular outcome. Here the mutation rate is defined as the number of codon changes per 10,000 codons per generation. This is roughly equivalent to the number of nucleotide substitutions per 30,000 nucleotides per generation. Thus if your starting sequence is 241 codons long (723 nucleotides), you specify a mutation rate of 2 codons per 10,000 codons per generation, and you specify 20 generations you expect roughly 241*20*(2/10000) = ~1 new mutation per new allele. >MolSelectGen.py Displays< V. SPECIFY EXTINCTION RATE: Enter a probability that one of your existing alleles will become extinct at the start of each mutation cycle (0.0 means everyone survives, 1.0 means everyone will become extinct; suggested value 0.05 or lower): >cursor< >You type a floating point value and press 'return': Enter a probability that one of your existing alleles will become extinct at the start of each mutation cycle (0.0 means everyone survives, 1.0 means everyone will become extinct; suggested value 0.05 or lower): 0.05 >rtn< >WHAT THIS INTERACTION MEANS: This is just away to allow some alleles to go 'extinct', thereby not contributing further to your resulting phylogeny and possibly resulting in more interesting trees and datasets. >MolSelectGen.py Displays< A whole bunch of stuff related to the project, then ... Writing multiFASTA file (alleles.fas) with all alleles... Writing log file. Job Complete! Remember, the file 'log.txt' contains all the information from the project including the parameters that you have specified. The authors encourage users to 'play' with the parameters to find the ideal settings for their lab/course. Questions? Comments? Suggestions? Contact Aaron at ampoo_twibbit@hotmail.com, or Alice at ashumate@fdu.edu. Include 'MolSelectGen' in the subject line. Thanks for trying this program. Tootles. Support OpenSource software. It's socially responsible. """ __author__ = """Aaron J. Windsor & Alice M. Shumate""" __copyright__="""(c) 2006, AJW & AMS""" __license__ = """Distributed under the terms of the General GNU Public License (GPL).""" __warranty__ = """NO WARRANTY is explicitly stated or implied. The author/ distributor makes no statements or claims, explicit or implied regarding the fitness of this software for any use or application. By using this software, the user is agreeing that the author/distributor is in no manner liable for any data/ software/hardware loss, damage, modification or any other undesirable outcome of this software's use. Further, the author/distributor makes no proprietory claims to any favorable results obtained through the use of this software.""" __readme__ = """Please read the author, license, and warranty information prior to using this software. By using this software you are accepting the terms laid out in the aforementioned documentation fields.""" __version__= """0.0.2, 20-June-2006""" ###IMPORTS### import os, string, sys, random ############# ###CLASSES### class CodonTable: """Reads codon tables specified in 'codon_tables.txt' file.""" def __init__(self): self.cache = [] self.tables = [] self.table = '' self.aa3 = {} #amino acid 3-letter to codons. self.aa1 = {} #amino acid 1-letter to codons. self.codon2aa = {} #codons to amino acid. self.start_codons = [] self.stop_codons = [] self.aa1Toaa3 = {} self.aa3Toaa1 = {} self.myfile = open('codon_tables.txt', 'rb') self.reader() self.myfile.close() self.chooser() self.grabber() self.summarizer = [self.table, self.aa3, self.aa1, self.start_codons, self.stop_codons, self.codon2aa, self.aa1Toaa3, self.aa3Toaa1] def reader(self): for line in self.myfile: a_line = line.strip() if a_line: if a_line[0] != '#': self.cache.append(a_line) if a_line[0] == '+': self.tables.append(a_line[1:]) else: pass else: pass else: pass def chooser(self): if len(self.tables) == 1: self.table = self.tables[0] else: test = 0 counter = 1 while not test: print 'Please select your preferred translation table:' for x in self.tables: print '\t%s - %s' % (counter, x) counter += 1 my_table = raw_input('Indicate numerical selection: ',) try: self.table = self.tables[int(my_table) - 1] test = 1 except: print 'Invalid selection.' test = 0 def grabber(self): test = 0 for x in self.cache: if not test: if x[1:] == self.table: test = 1 else: pass elif x == '//': test = 0 else: an_entry = x.split(' ') if an_entry[0] not in ('START', 'STOP'): self.aa1Toaa3[an_entry[1]] = an_entry[0] self.aa3Toaa1[an_entry[0]] = an_entry[1] self.aa3[an_entry[0]] = [] self.aa1[an_entry[1]] = [] for y in an_entry[2:]: self.aa3[an_entry[0]].append(y) self.aa1[an_entry[1]].append(y) self.codon2aa[y] = (an_entry[0], an_entry[1]) else: if an_entry[0] == 'START': for y in an_entry[2:]: self.start_codons.append(y) else: for y in an_entry[2:]: self.stop_codons.append(y) def __call__(self): return self.summarizer class FastaExtractor: #code taken from FastaTools.py v0.0.4 """Pulls FASTA file into a dictionary object where keys = header and sequences are lists of nucleotides or characters.""" def __init__(self, file_object, no_gaps = 0): self.fasta_in = file_object self.no_gaps = no_gaps #if true, gap characters will be stripped. self.entry_order = [] self.fasta = {} self.extract() def extract(self): current_header = '' for line in self.fasta_in: aline = line.strip() if aline: if aline[0] == '>': current_header = aline self.entry_order.append(current_header) self.fasta[current_header] = [] elif aline[0] == ';': #Currently, extractor doesn't grab pass #FASTA comments, fix this later. elif self.no_gaps: for x in aline: #this is inefficient, fix later. if x in string.letters: self.fasta[current_header].append(x) else: pass else: self.fasta[current_header] += list(aline) else: pass def __call__(self): return self.fasta class FastaConcat: #code taken from FastaTools.py v0.0.4 """Takes a Fasta dictionary object and combines it into a concatenated FASTA file.""" def __init__(self, fasta_dict, out_path = None, line_len = 60, entry_order = None): self.fasta = fasta_dict self.line_len = line_len self.out_path = out_path if not entry_order: self.entry_order = self.fasta.keys() else: self.entry_order = entry_order self.fasta_out = [] if not self.out_path: raise('NoOutputPath', 'No path to an output file was specified') sys.exit() self.concater() def concater(self): my_order = self.entry_order for x in my_order: counter = 0 temp_catch = [] self.fasta_out.append(x + '\n') for y in self.fasta[x]: temp_catch.append(y) counter += 1 if counter == self.line_len: self.fasta_out.append(''.join(temp_catch) + '\n') temp_catch = [] counter = 0 if temp_catch: self.fasta_out.append(''.join(temp_catch) + '\n') out_phile = open(self.out_path, 'wb') out_phile.writelines(self.fasta_out) out_phile.close() ###FUNCTIONS### def translator(seq_name, nt_list, codon2aa): """Reads CDS 3 nt at a time. Compiles a list of tuples: (aa, codon from original sequence). Returns a dictionary object seq_name:above list.""" my_trans = {} my_trans[seq_name] = [] counter = 0 seq_len = len(nt_list) while counter + 3 <= seq_len: my_codon = ''.join(nt_list[counter:counter + 3]) try: my_trans[seq_name].append((codon2aa[my_codon][1], my_codon)) #print '%s %s' % (codon2aa[my_codon][1], my_codon) counter += 3 except: print 'Sequence is not translatable.' print '\n' + __doc__ return my_trans def mutator(a_tuple, ns, s, codon_usage, table, log_info = None): new_tuple = None nts = {1:'A', 2:'G', 3:'C', 4:'U'} mut_type = random.randrange(1, ns + s + 1, 1) complete = 0 while not complete: if mut_type <= ns: winner = 0 while not winner: start_codon = list(a_tuple[1]) start_codon[random.randrange(0, 3, 1)] = nts[random.randrange(1, 5, 1)] end_codon = ''.join(start_codon) if end_codon in codon_usage: if codon_usage[end_codon][1] != a_tuple[0]: new_tuple = (codon_usage[end_codon][1], end_codon) winner = 1 complete = 1 else: pass else: pass else: candidates = [] for x in table[a_tuple[0]]: if x[0:2] == a_tuple[1][0:2] and x != a_tuple[1]: candidates.append(x) else: pass if not candidates: print '\t\t\tNo synonymous substitution possible for %s.\n\tWill create a nonsynonymous substitution.' % (a_tuple[0]) if log_info: log_info.append('\t\t\tNo synonymous substitution possible for %s.\n\tWill create a nonsynonymous substitution.\n' % (a_tuple[0])) mut_type = 1 else: new_tuple = (a_tuple[0], candidates[random.randrange(0, len(candidates), 1)]) complete = 1 return new_tuple ############### ###PSEUDO-DECLARATIONS### #Yeah, I know declarations are not necessary in Python, #but it makes me more comfortable, keeps me organized. working = os.getcwd() #current working directory. my_dir = [] #cwd structure. ancestor_file = '' #name of file containing ancestral sequence. log_data = [] #collect data for logfile. alleles = {} #a FASTA dictionary object. Catch ancestor and new alleles nucleotide sequences. translations = {} #collect translations new_entry_order = [] num_alleles = 0 gens = 0 #number of generation to generate 1 new allele. region = {} #stores regions for varios ns to s ratios. #nonsyn_syn = 0.0 #target ratio of nonsynonymous to synonymous changes. nonsyn = 0 #relative number of nonsynonymous substitutions. syn = 1000 #relative number of synonymous substitutions. mut_rate = 0 #target number of nucleotide changes per 10000 nucleotides per generation. purge = 0.0 #probability that a generated allele will go extinct. ######################### if __name__ == '__main__': if len(sys.argv) != 1: #call for help print __doc__ sys.exit() else: pass trans = CodonTable() for x in os.walk(working): my_dir.append(x) for x in my_dir[0][2]: if x[-4:] in ('.fas', '.FAS'): if not ancestor_file: ancestor_file = x else: print '\n\nCurrent working directory contains multiple FASTA files.\nEXITING.\n' + __doc__ sys.exit() else: pass print "\n\nIdentified 'initial' sequence, %s.\nOpening and processing...\n" % (ancestor_file) log_data.append('Initial sequence: %s\n' % (ancestor_file)) phile = open(ancestor_file, 'rb') ancest = FastaExtractor(phile, 1) phile.close() if len(ancest()) != 1: print 'Intitial FASTA file contains multiple sequences.\nEXITING.\n' + __doc__ sys.exit() for x in ancest(): raster = len(ancest()[x])%3 if raster != 0: #oops, incomplete codon somewhere. print '\nThe length of the initial sequence is not an increment of 3. Please check the translation (extra nucleotides at start?) or try trimming the terminal %s nucleotides.' % (raster) print __doc__ sys.exit() else: pass new_name = x[0] + '000_' + x[1:] new_entry_order.append(new_name) alleles[new_name] = [] for y in ancest()[x]: if y not in ('t', 'T'): alleles[new_name].append(y.upper()) else: alleles[new_name].append('U') #switch to RNA encoding translations.update(translator(new_name, alleles[new_name], trans()[5])) print 'Initial sequence recognized, read and processed\nwithout errors.\n\nI. SELECT NUMBER OF ALLELES TO GENERATE:' ok = 0 while not ok: try: num_alleles = int(raw_input("\tEnter integer for the number of alleles you wish\n\tto generate (e.g. 25): ",).strip()) ok = 1 except: print '!!!Enter an INTEGER!!!' ok = 0 print '\n\nII. DEFINE NUMBER OF GENERATIONS:' ok = 0 while not ok: try: gens = int(raw_input("\tEnter integer for the number of generations\n\trequired to generate a 'new' allele\n\t(e.g. 20): ",).strip()) log_data.append('Number of Generations: %s\n' % (gens)) ok = 1 except: print '!!!Enter an INTEGER!!!' ok = 0 print '\n\nIII. SET THE PROPORTION OF NONSYN to SYN SUBSTITUTIONS:' anc_length = 0 for x in translations: anc_length = len(translations[x]) print "\tNow, you will determine the rate(s) of nonsynonymous\n\tto synonymous substitutions in your 'evolved'\n\talleles. By starting at codon '1' and ending at\n\tcodon '%s', you will select a constant rate across\n\tthe sequence. Regions that are unspecified will\n\tdefault to a rate of '1.0' (e.g. neutral evolution).\n\tRemember, 1 codon equals 3 nucleotides." % (anc_length) done = 0 while not done: reg_start = 0 reg_end = 0 reg_nonsyn_syn = 1.0 doodad = raw_input("\n\tYour sequence contains %s codons. Please indicate\n\tthe START POSITION of your region (enter an integer\n\tor type 'done' if finished entering regions): " % (anc_length),) if doodad.upper() != 'DONE': try: reg_start = int(doodad) except: print "!!!Please enter an integer or type 'done'!!!" continue ok = 0 while not ok: try: reg_end = int(raw_input("\n\tYour sequence contains %s codons. Please indicate\n\tthe END POSITION of your region (enter an integer): " % (anc_length),)) ok = 1 except: print "!!!Please enter an integer!!!" ok = 0 ok = 0 while not ok: try: reg_nonsyn_syn = float(raw_input("\n\tEnter a floating point value for target nonsynonymous\n\tto synonymous ratio (e.g. 0.001 = constraint,\n\t1.000 = neutrality, >> 1.000 positive selection).\n\tPlease limit to 3 significant digits: ",).strip()) #nonsyn = int(nonsyn_syn*1000) ok = 1 except: print '!!!Enter a FLOATING POINT VALUE!!!' ok = 0 if not region: region[(reg_start, reg_end)] = reg_nonsyn_syn else: AOK = 1 for x in region: exist_range = range(x[0], x[1]) if reg_start in exist_range: print '\n!!!Your new region overlaps an existing entry.\n\tPlease re-enter the previous region.!!!\n' AOK = 0 elif reg_end in exist_range: print '\n!!!Your new region overlaps an existing entry.\n\tPlease re-enter the previous region.!!!\n' AOK = 0 else: pass if AOK: region[(reg_start, reg_end)] = reg_nonsyn_syn else: if not region: log_data.append('The entire sequence has been assigned a nonsyn to syn of 1.0.\n') else: for reg in region: log_data.append('Region spanning codon %s to codon %s was\n\tassigned a nonsyn to syn of %s.\n' % (reg[0], reg[1], region[reg])) log_data.append('Any additional region have been assigned a\n\tnonsyn to syn of 1.0.\n') done = 1 print '\n\nIV. SPECIFY MUTATION RATE:' ok = 0 while not ok: try: mut_rate = int(raw_input("\tEnter an integer for the target mutation rate\n\t(codon changes per 10,000 codons per generation;\n\te.g. 5): ",).strip()) log_data.append('Mutation rate: %s\n' % (mut_rate)) ok = 1 except: print '!!!Enter an INTEGER!!!' ok = 0 print '\n\nV. SPECIFY EXTINCTION RATE:' ok = 0 while not ok: try: purge = float(raw_input("\tEnter a probability that one of your existing\n\talleles will become extinct at the start of each\n\tmutation cycle (0.0 means everyone survives,\n\t1.0 means everyone will become extinct;\n\tsuggested value 0.05 or lower): ",).strip()) log_data.append('Extinction rate: %s\n' % (purge)) ok = 1 except: print 'Enter a FLOATING POINT VALUE!' ok = 0 print '\n\nSimilating dataset ...' mutate = range(1, mut_rate + 1) #value that if randomly selected result in mutation. up_count = 1 while len(new_entry_order) != num_alleles + 1: if random.random() <= purge and len(new_entry_order) != 1: extinct = random.choice(new_entry_order) del(new_entry_order[new_entry_order.index(extinct)]) del(translations[extinct]) print "\t\tAllele, %s, rendered extinct." % (extinct) log_data.append("\t\tAllele, %s, rendered extinct.\n" % (extinct)) else: pass my_choices = translations.keys() immediate_ancestor = random.choice(my_choices) new_allele = immediate_ancestor[0] + '%.3d_' % (up_count) + immediate_ancestor[5:] new_entry_order.append(new_allele) log_out = 'Generating %s from %s...' % (new_allele, immediate_ancestor) print log_out log_data.append(log_out + '\n') translations[new_allele] = translations[immediate_ancestor][:] this_gen = gens num_muts = 0 while this_gen != 0: my_offset = 0 #keeps track of where we are in a translation. for x in translations[new_allele]: do_we_mut = random.randrange(1, 10001, 1) if do_we_mut in mutate: for y in region: if my_offset + 1 in range(y[0], y[1]): nonsyn = int(region[y]*1000) break else: pass else: nonsyn = 1000 whosi = '\t\tMutating codon %s, assuming NS to S of %s.' % (my_offset + 1, float(nonsyn)/float(syn)) print whosi log_data.append(whosi + '\n') translations[new_allele][my_offset] = mutator(x, nonsyn, syn, trans()[5], trans()[2], log_data) num_muts += 1 else: pass my_offset += 1 this_gen -= 1 mut_gen = '\t\tMutations generated: %s.' % (num_muts) print mut_gen log_data.append(mut_gen + '\n') up_count += 1 for x in translations: realcodons = [] for y in translations[x]: realcodons.append(y[1]) realcodons = ''.join(realcodons) alleles[x] = [] for y in realcodons: if y != 'U': alleles[x].append(y) else: alleles[x].append('T') print 'Writing multiFASTA file (alleles.fas) with all alleles...' FastaConcat(alleles, 'alleles.fas', 60, new_entry_order) print 'Writing log file.' phile = open('log.txt', 'wb') phile.writelines(log_data) phile.close() print 'Job Complete!' #print trans()[0] #for x in trans()[1]: #print x, #print trans()[1][x] #for x in trans()[2]: #print x, #print trans()[2][x] #print 'START CODONS ', #print trans()[3] #print 'STOP CODONS ', #print trans()[4] #for x in trans()[5]: #print x, #print trans()[5][x] #for x in trans()[6]: #print x, #print trans()[6][x] #for x in trans()[7]: #print x, #print trans()[7][x]