Python pattern matching algorithm tuple error

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • DossarLX ODI
    Batch Manager
    Game Manager
    FFR Simfile Author
    • Mar 2008
    • 14998

    #16
    Re: Python pattern matching algorithm tuple error

    You can insert gaps in both sequences. This picture is straight out of the assignment; the table is annoying as fuck to read so I'll basically just say what he was getting at.

    Original sequences:
    seq1: 'ATG'
    seq2: 'GGAAT'

    The two alignments listed

    G G A A T
    - - A T G

    G G A A T -
    - - - A T G

    give the same score.

    For the first one: (0) - 5 - 5 + 20 + 5 + 5 = 20
    For the second one: (0) - 5 - 5 - 5 + 20 + 20 - 5 = 20



    It pisses me off how ambiguous my professor was on this assignment. Apparently the score has to add up everything, even mismatches, and only penalize for gaps. This is stupid.

    Originally posted by Reincarnate
    (I also don't want to waste your time if I am going off-track with all this -- feel free to ignore if you're on a tight deadline)
    The current code I have doesn't work since I can't even import numpy or scipy. There's something about them being 32-bit when I have a 64-bit machine. Even then I don't even know if the code from the textbook even works because well, the variable names used were absolute garbage and it had errors when I tried running it.
    Last edited by DossarLX ODI; 02-9-2014, 10:55 AM.
    Originally posted by hi19hi19
    oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

    Comment

    • Reincarnate
      x'); DROP TABLE FFR;--
      • Nov 2010
      • 6332

      #17
      Re: Python pattern matching algorithm tuple error

      So it's always +matrix[base1][base2] and then -5 if we have "*-" or "-*" where * is A,C,T, or G?

      Comment

      • Reincarnate
        x'); DROP TABLE FFR;--
        • Nov 2010
        • 6332

        #18
        Re: Python pattern matching algorithm tuple error

        You can use numpy if you're using a 64-bit machine (that's what I'm on) -- you just have to make sure you're using the right versions of Python/Numpy/etc.

        Comment

        • DossarLX ODI
          Batch Manager
          Game Manager
          FFR Simfile Author
          • Mar 2008
          • 14998

          #19
          Re: Python pattern matching algorithm tuple error

          Yes, if there is a dash it is a gap penalty. If the letters don't match up it still adds positively to the score if they're not gaps (which I think is very counterintuitive).

          I am using Python 2.7.6 by the way, which are you using?
          Originally posted by hi19hi19
          oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

          Comment

          • Reincarnate
            x'); DROP TABLE FFR;--
            • Nov 2010
            • 6332

            #20
            Re: Python pattern matching algorithm tuple error

            Is it possibly something like this?

            Code:
            class memoize:
                def __init__(self, fn):
                    self.fn = fn
                    self.memo = {}
                def __call__(self, *args, **kwds):
                    import cPickle
                    str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
                    if not self.memo.has_key(str):
                        self.memo[str] = self.fn(*args, **kwds)
                    return self.memo[str]
            
            @memoize
            def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
                if len(seq1)*len(seq2)==0: #one or the other is 0
                    return g*abs(len(seq1)-len(seq2)) #the rest shall be gaps, damn it
            
                possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
                possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
                possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
                
                return max([possibleScore1, possibleScore2, possibleScore3])
            
            
            
            mat = [[20, 10, 5, 5],
                   [10, 20, 5, 5],
                   [5, 5, 20, 10],
                   [5, 5, 10, 20]]
            
            alphabet = "TCAG"
            baseMap = {base:index for index,base in enumerate(alphabet)}
            print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')
            I am using 2.7.3 (I should probably upgrade but this version has always done everything I've needed so I haven't changed it yet)
            Last edited by Reincarnate; 02-9-2014, 11:05 AM.

            Comment

            • DossarLX ODI
              Batch Manager
              Game Manager
              FFR Simfile Author
              • Mar 2008
              • 14998

              #21
              Re: Python pattern matching algorithm tuple error

              Originally posted by Reincarnate
              Is it possibly something like this?

              Code:
              class memoize:
                  def __init__(self, fn):
                      self.fn = fn
                      self.memo = {}
                  def __call__(self, *args, **kwds):
                      import cPickle
                      str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
                      if not self.memo.has_key(str):
                          self.memo[str] = self.fn(*args, **kwds)
                      return self.memo[str]
              
              @memoize
              def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
                  if len(seq1)*len(seq2)==0: #one or the other is 0
                      return g*abs(len(seq1)-len(seq2)) #the rest shall be gaps, damn it
              
                  possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
                  possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
                  possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
                  
                  return max([possibleScore1, possibleScore2, possibleScore3])
              
              
              
              mat = [[20, 10, 5, 5],
                     [10, 20, 5, 5],
                     [5, 5, 20, 10],
                     [5, 5, 10, 20]]
              
              alphabet = "TCAG"
              baseMap = {base:index for index,base in enumerate(alphabet)}
              print dpScore(mat, baseMap, 'ATG', 'GGAAT')
              I am using 2.7.3 (I should probably upgrade but this version has always done everything I've needed so I haven't changed it yet)
              I tested this out with ATG and GGAAT and it gave me 20 (like in the document).

              I'm lost in that block of code you wrote though. Like I mentioned I'm new to python so seeing all that code is like drinking out of a fire hose; mind giving some comments please?
              Last edited by DossarLX ODI; 02-9-2014, 11:09 AM.
              Originally posted by hi19hi19
              oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

              Comment

              • Reincarnate
                x'); DROP TABLE FFR;--
                • Nov 2010
                • 6332

                #22
                Re: Python pattern matching algorithm tuple error

                I mean right now I'm using a memoization class which is the "lazy man's" way to use dynamic programming in Python (and in this case it's memoizing a lot more than it needs to, but I digress -- this was just to ensure the score's right and recursion is easy to code).

                Basically, at each step of the DP, it takes the max of three possibilities:

                1. Inserting a gap in the second sequence, incurring the g-cost, and then continuing onward
                2. Inserting a gap in the first sequence, incurring the g-cost, and then continuing onward
                3. Not inserting a gap, so taking the score of the base-pair, and continuing onward

                If we run out of bases in a particular string, then we just incur gap-costs for the rest (however long the remaining string is)

                Since it's constantly taking the max of all three possibilities at each step, and recursing at each step, you get the maximum possible score from all possible ways of inserting gaps/not inserting gaps every which way.

                For really long strings, this could take forever, but that's what DP is for:

                Memoization (in this case) is just caching results at intermediate steps so they don't have to get re-calculated over and over again (which is the heart of DP).

                For instance, if I asked you to add 1+2+3+4+5, you'd tell me "15!"
                Now if I asked, "Okay, now what is 1+2+3+4+5+10", you'd very quickly tell me "25!"

                In this case, you knew it was just 10+your previous answer, so you didn't re-calculate "1+2+3+4+5+10" from scratch. Memoization is basically the same thing.

                Disclaimer: not a CS expert, so I'm sure someone will come along and point out all the flaws in what I've said
                Last edited by Reincarnate; 02-9-2014, 11:25 AM.

                Comment

                • Reincarnate
                  x'); DROP TABLE FFR;--
                  • Nov 2010
                  • 6332

                  #23
                  Re: Python pattern matching algorithm tuple error

                  Let's say we had the two sequences:

                  s1 = "ATG"
                  s2 = "TG"

                  so there are three options:



                  1. insert a gap in the second sequence, continue with the rest:

                  "A" + "TG"
                  "-" + "TG"

                  so here we take the gap cost of -5 and continue with the two substrings "TG" and "TG"



                  2. insert a gap in the first sequence, continue with the rest:

                  "-" + "ATG"
                  "T" + "G"

                  so here we take the gap cost of -5 and continue with the two substrings "ATG" and "G"



                  3. Take the current base pair, continue with the rest

                  "A" + "TG"
                  "T" + "G"

                  so here we take whatever matrix["A"]["T"] is and continue with the two substrings "TG" and "G"

                  And so each time we "continue with substrings" we are performing the same DP calculation on smaller sub-problems. Every time we encounter a new subproblem, we're always taking the max of these three options. As there are many ways to arrive at the same sub-problems, we cache the results as we calculate stuff so we can re-use them again later if we encounter them.

                  Comment

                  • DossarLX ODI
                    Batch Manager
                    Game Manager
                    FFR Simfile Author
                    • Mar 2008
                    • 14998

                    #24
                    Re: Python pattern matching algorithm tuple error

                    So essentially you are remembering which position of the matrix you are in so when you consider the maximum of three possibilities you can just start from that index again. That makes sense.

                    Also, that means you're splitting the string into a smaller string each time you do that correct?

                    Comments in that code would be very useful because I'm getting the syntax for Python down to see what your code is actually saying.
                    matrix[i,1:] # Take ith row, go from column 1 to the last column for each row
                    matrix[0] = g # entire first row all gap penalties?
                    matrix[:,0] = g # make first column all gap penalties?
                    string[::-1] # Reverse string
                    alphabet.index( char ) # Turn character into an integer based on alphabet values
                    # [x,y] is interpreted as (x,y)
                    # [x][y] is the index in the matrix
                    Last edited by DossarLX ODI; 02-9-2014, 11:35 AM.
                    Originally posted by hi19hi19
                    oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

                    Comment

                    • Reincarnate
                      x'); DROP TABLE FFR;--
                      • Nov 2010
                      • 6332

                      #25
                      Re: Python pattern matching algorithm tuple error

                      does this help

                      Code:
                      class memoize:
                          def __init__(self, fn):
                              self.fn = fn
                              self.memo = {}
                          def __call__(self, *args, **kwds):
                              import cPickle
                              str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
                              if not self.memo.has_key(str):
                                  self.memo[str] = self.fn(*args, **kwds)
                              return self.memo[str]
                      
                      #memoization decorator, so any time dpScore(args) is called with arguments we've seen before,
                      #we don't re-calculate it -- we return the answer that we cached for it
                      @memoize
                      def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
                          if len(seq1)*len(seq2)==0: #if one or the other is length 0, i.e. an empty string
                              return g*abs(len(seq1)-len(seq2)) #then we incur gaps for the length of the difference
                      
                          #note that if both strings are length 0, no gap costs are incurred
                      
                          #seq1[1:] means "everything but the first character of seq1"
                          #This is (iirc) called a string-slice operation. 
                          #edit: here http://techearth.net/python/index.php5?title=Python:Basics:Slices
                      
                          possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
                          possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
                          #these two options reflect using a gap at the front of seq2 and seq1, respectively.
                          
                          possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
                          #we aren't using a gap here, but we're looking at the first char of each sequence,
                          #using the baseMap to get the correct indices for the matrix, and then using the matrix
                          #to get the value we need to add to this score.
                      
                          return max([possibleScore1, possibleScore2, possibleScore3])
                          #return the max score from the three possibilities
                      
                      
                      
                      mat = [[20, 10, 5, 5],
                             [10, 20, 5, 5],
                             [5, 5, 20, 10],
                             [5, 5, 10, 20]]
                      
                      alphabet = "TCAG"
                      baseMap = {base:index for index,base in enumerate(alphabet)}
                      #creates a lookup for the matrix. So baseMap["T"] = 0, baseMap["C"] = 1, etc.
                      
                      print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')
                      And yeah when it comes to list-type objects in Python, like L = [1,4,7,3], then doing something like L[2] gives you the third element of the list (due to 0-indexing, L[0] is the first element).

                      A matrix, here, is just a list of lists, so mat[i][j] is telling you which sub-list and which element of that sub-list to look at.
                      Last edited by Reincarnate; 02-9-2014, 11:44 AM.

                      Comment

                      • Reincarnate
                        x'); DROP TABLE FFR;--
                        • Nov 2010
                        • 6332

                        #26
                        Re: Python pattern matching algorithm tuple error

                        Using a dictionary object to cache results instead of a memoization decorator:

                        Code:
                        cache = {} 
                        '''
                        cache here is a dictionary object, which means you do cache[key] and you get a corresponding value back. 
                        The key can be (mostly) anything you want (IIRC, as long as it's an immutable object like a tuple).
                        A tuple is something like (1,2,3) versus a list [1,2,3]. Tuples are basically "immutable" lists, 
                        which means they can't be modified directly.
                        
                        By the way, these triple-apostrophe things are how you do comment-blocks. Pound/hashtag sign is for a line-comment.
                        '''
                        
                        def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
                            if len(seq1)*len(seq2)==0: #if one or the other is 0, i.e. an empty string
                                return g*abs(len(seq1)-len(seq2)) #then we incur gaps for the length of the difference
                        
                            #note that if both strings are length 0, no gap costs are incurred
                        
                            if (seq1,seq2) in cache: return cache[(seq1,seq2)] 
                            #if we've already solved the sub-problem for our two strings, return the answer we cached
                        
                            #seq1[1:] means "everything but the first character of seq1"
                        
                            possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
                            possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
                            #these two options reflect using a gap at the front of seq2 and seq1, respectively.
                            
                            possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
                            #we aren't using a gap here, but we're looking at the first char of each sequence,
                            #using the baseMap to get the correct indices for the matrix, and then using the matrix
                            #to get the value we need to add to this score.
                        
                            cache[(seq1,seq2)] = max([possibleScore1, possibleScore2, possibleScore3]) #cache the result
                            return cache[(seq1,seq2)]
                            #return the max score from the three possibilities
                        
                        
                        
                        mat = [[20, 10, 5, 5],
                               [10, 20, 5, 5],
                               [5, 5, 20, 10],
                               [5, 5, 10, 20]]
                        
                        alphabet = "TCAG"
                        baseMap = {base:index for index,base in enumerate(alphabet)}
                        #creates a lookup for the matrix. So baseMap["T"] = 0, baseMap["C"] = 1, etc.
                        
                        print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')
                        Last edited by Reincarnate; 02-9-2014, 12:04 PM. Reason: added comments

                        Comment

                        • DossarLX ODI
                          Batch Manager
                          Game Manager
                          FFR Simfile Author
                          • Mar 2008
                          • 14998

                          #27
                          Re: Python pattern matching algorithm tuple error

                          Currently looking at your comments to see what you're getting at. From the memoize comments,

                          Code:
                          def dpScore():
                              # function stuff
                          dpScore = memoize(dpScore)
                          
                          @memoize
                          def dpScore():
                              # function stuff
                          Do the same exact thing then.
                          Originally posted by hi19hi19
                          oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

                          Comment

                          • Reincarnate
                            x'); DROP TABLE FFR;--
                            • Nov 2010
                            • 6332

                            #28
                            Re: Python pattern matching algorithm tuple error

                            If that code makes sense then you can try an iterative-based approach to the DP using arrays (in Python, technically lists).

                            Note that all possible subproblems are basically of form:
                            (some substring of the first main sequence, some substring of the second main sequence)

                            where for us, in our example the main sequences are 'ATCGTAGTA' and 'ATGTTAT'

                            So if sequence 1 is length I and sequence 2 is length J, we can have a matrix of size (I+1)*(J+1) and each element of this matrix will correspond to the maximal score you can get from the substrings. (the +1 on each side is because it is possible to compare a substring against a 0-length string)

                            So for example, if we call our matrix "dp," then we might have dp[3][1] or something, which means we're looking at the maximal score that comes from the subproblem ("GTAGTA", "TGTTAT"). In other words, the index-3 character onward for the first main sequence, and the index-1 character onward for the second main sequence.

                            This means, at the very end, dp[0][0] will be our answer.


                            I'll explain this later when I have time (or someone else can explain it) -- I have to run out for the next few hours:
                            (I also haven't debugged this yet so there may be a silly error somewhere, I'll fix later)

                            Code:
                            mat = [[20, 10, 5, 5],
                                   [10, 20, 5, 5],
                                   [5, 5, 20, 10],
                                   [5, 5, 10, 20]]
                            
                            alphabet = "TCAG"
                            baseMap = {base:index for index,base in enumerate(alphabet)}
                            
                            
                            seq1='ATCGTAGTA'
                            seq2='ATGTTAT'
                            g=-5
                            
                            
                            dp = [[0 for j in xrange(len(seq2)+1)] for i in xrange(len(seq1)+1)] #creates a list of lists, which we use as a matrix
                            
                            for i in xrange(len(seq1)+1):
                                dp[i][len(seq2)] = g*abs(len(seq1)-i)  #initialize the cases where we have second substring empty
                            for j in xrange(len(seq2)+1):
                                dp[len(seq1)][j] = g*abs(len(seq2)-j) #initialize the cases where we have first substring empty
                            
                            
                            #http://www.velocityreviews.com/forums/t636275-how-to-make-a-reverse-for-loop-in-python.html
                            
                            for i in xrange(len(seq1)-1,-1,-1): #len(seq1)-1 <= i <= 0, iterating down by -1
                                for j in xrange(len(seq2)-1,-1,-1): #len(seq2)-1 <= j <= 0, iterating down by -1.
                                    possibleScore1 = g + dp[i+1][j]
                                    possibleScore2 = g + dp[i][j+1]
                                    possibleScore3 = mat[baseMap[seq1[i]]][baseMap[seq2[j]]] + dp[i+1][j+1]
                                    dp[i][j] = max([possibleScore1, possibleScore2, possibleScore3])
                                    
                            print dp[0][0]
                            Last edited by Reincarnate; 02-9-2014, 12:39 PM. Reason: made a quick fix, running out now

                            Comment

                            • DossarLX ODI
                              Batch Manager
                              Game Manager
                              FFR Simfile Author
                              • Mar 2008
                              • 14998

                              #29
                              Re: Python pattern matching algorithm tuple error

                              The memoize class constructor stuff in the beginning of the recursive algorithm code will need to be discussed more later, I sent a PM about possibly discussing over some kind of instant messanging program. I'll stick with the iterative solution since it doesn't require constructing a class.

                              Also your iterative code works.
                              Last edited by DossarLX ODI; 02-9-2014, 12:50 PM.
                              Originally posted by hi19hi19
                              oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

                              Comment

                              • Reincarnate
                                x'); DROP TABLE FFR;--
                                • Nov 2010
                                • 6332

                                #30
                                Re: Python pattern matching algorithm tuple error

                                Post any questions you have and I'll address them when I can (I am not at comp atm)

                                (It's probably best to ignore the solution with the memo class for easiest understanding -- use the one with the cache dict instead, or the iterative translation)

                                Comment

                                Working...