Python pattern matching algorithm tuple error

**DossarLX ODI** · 02-9-2014, 10:50 AM

Re: Python pattern matching algorithm tuple error

You can insert gaps in both sequences. This picture is straight out of the assignment; the table is annoying as fuck to read so I'll basically just say what he was getting at.

Original sequences:
seq1: 'ATG'
seq2: 'GGAAT'

The two alignments listed

G G A A T
- - A T G

G G A A T -
- - - A T G

give the same score.

For the first one: (0) - 5 - 5 + 20 + 5 + 5 = 20
For the second one: (0) - 5 - 5 - 5 + 20 + 20 - 5 = 20

It pisses me off how ambiguous my professor was on this assignment. Apparently the score has to add up everything, even mismatches, and only penalize for gaps. This is stupid.

Originally posted by Reincarnate

(I also don't want to waste your time if I am going off-track with all this -- feel free to ignore if you're on a tight deadline)

The current code I have doesn't work since I can't even import numpy or scipy. There's something about them being 32-bit when I have a 64-bit machine. Even then I don't even know if the code from the textbook even works because well, the variable names used were absolute garbage and it had errors when I tried running it.

**Reincarnate** · 02-9-2014, 10:54 AM

Re: Python pattern matching algorithm tuple error

So it's always +matrix[base1][base2] and then -5 if we have "*-" or "-*" where * is A,C,T, or G?

**Reincarnate** · 02-9-2014, 10:57 AM

Re: Python pattern matching algorithm tuple error

You can use numpy if you're using a 64-bit machine (that's what I'm on) -- you just have to make sure you're using the right versions of Python/Numpy/etc.

**DossarLX ODI** · 02-9-2014, 10:57 AM

Re: Python pattern matching algorithm tuple error

Yes, if there is a dash it is a gap penalty. If the letters don't match up it still adds positively to the score if they're not gaps (which I think is very counterintuitive).

I am using Python 2.7.6 by the way, which are you using?

**Reincarnate** · 02-9-2014, 10:59 AM

Re: Python pattern matching algorithm tuple error

Is it possibly something like this?

Code:

class memoize:
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args, **kwds):
        import cPickle
        str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
        if not self.memo.has_key(str):
            self.memo[str] = self.fn(*args, **kwds)
        return self.memo[str]

@memoize
def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
    if len(seq1)*len(seq2)==0: #one or the other is 0
        return g*abs(len(seq1)-len(seq2)) #the rest shall be gaps, damn it

    possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
    possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
    possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
    
    return max([possibleScore1, possibleScore2, possibleScore3])



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}
print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')

I am using 2.7.3 (I should probably upgrade but this version has always done everything I've needed so I haven't changed it yet)

**DossarLX ODI** · 02-9-2014, 11:07 AM

Re: Python pattern matching algorithm tuple error

Originally posted by Reincarnate

Is it possibly something like this?

Code:

class memoize:
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args, **kwds):
        import cPickle
        str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
        if not self.memo.has_key(str):
            self.memo[str] = self.fn(*args, **kwds)
        return self.memo[str]

@memoize
def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
    if len(seq1)*len(seq2)==0: #one or the other is 0
        return g*abs(len(seq1)-len(seq2)) #the rest shall be gaps, damn it

    possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
    possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
    possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
    
    return max([possibleScore1, possibleScore2, possibleScore3])



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}
print dpScore(mat, baseMap, 'ATG', 'GGAAT')

I am using 2.7.3 (I should probably upgrade but this version has always done everything I've needed so I haven't changed it yet)

I tested this out with ATG and GGAAT and it gave me 20 (like in the document).

I'm lost in that block of code you wrote though. Like I mentioned I'm new to python so seeing all that code is like drinking out of a fire hose; mind giving some comments please?

**Reincarnate** · 02-9-2014, 11:15 AM

Re: Python pattern matching algorithm tuple error

I mean right now I'm using a memoization class which is the "lazy man's" way to use dynamic programming in Python (and in this case it's memoizing a lot more than it needs to, but I digress -- this was just to ensure the score's right and recursion is easy to code).

Basically, at each step of the DP, it takes the max of three possibilities:

1. Inserting a gap in the second sequence, incurring the g-cost, and then continuing onward
2. Inserting a gap in the first sequence, incurring the g-cost, and then continuing onward
3. Not inserting a gap, so taking the score of the base-pair, and continuing onward

If we run out of bases in a particular string, then we just incur gap-costs for the rest (however long the remaining string is)

Since it's constantly taking the max of all three possibilities at each step, and recursing at each step, you get the maximum possible score from all possible ways of inserting gaps/not inserting gaps every which way.

For really long strings, this could take forever, but that's what DP is for:

Memoization (in this case) is just caching results at intermediate steps so they don't have to get re-calculated over and over again (which is the heart of DP).

For instance, if I asked you to add 1+2+3+4+5, you'd tell me "15!"
Now if I asked, "Okay, now what is 1+2+3+4+5+10", you'd very quickly tell me "25!"

In this case, you knew it was just 10+your previous answer, so you didn't re-calculate "1+2+3+4+5+10" from scratch. Memoization is basically the same thing.

Disclaimer: not a CS expert, so I'm sure someone will come along and point out all the flaws in what I've said

**Reincarnate** · 02-9-2014, 11:28 AM

Re: Python pattern matching algorithm tuple error

Let's say we had the two sequences:

s1 = "ATG"
s2 = "TG"

so there are three options:

1. insert a gap in the second sequence, continue with the rest:

"A" + "TG"
"-" + "TG"

so here we take the gap cost of -5 and continue with the two substrings "TG" and "TG"

2. insert a gap in the first sequence, continue with the rest:

"-" + "ATG"
"T" + "G"

so here we take the gap cost of -5 and continue with the two substrings "ATG" and "G"

3. Take the current base pair, continue with the rest

"A" + "TG"
"T" + "G"

so here we take whatever matrix["A"]["T"] is and continue with the two substrings "TG" and "G"

And so each time we "continue with substrings" we are performing the same DP calculation on smaller sub-problems. Every time we encounter a new subproblem, we're always taking the max of these three options. As there are many ways to arrive at the same sub-problems, we cache the results as we calculate stuff so we can re-use them again later if we encounter them.

**DossarLX ODI** · 02-9-2014, 11:29 AM

Re: Python pattern matching algorithm tuple error

So essentially you are remembering which position of the matrix you are in so when you consider the maximum of three possibilities you can just start from that index again. That makes sense.

Also, that means you're splitting the string into a smaller string each time you do that correct?

Comments in that code would be very useful because I'm getting the syntax for Python down to see what your code is actually saying.
matrix[i,1:] # Take ith row, go from column 1 to the last column for each row
matrix[0] = g # entire first row all gap penalties?
matrix[:,0] = g # make first column all gap penalties?
string[::-1] # Reverse string
alphabet.index( char ) # Turn character into an integer based on alphabet values
# [x,y] is interpreted as (x,y)
# [x][y] is the index in the matrix

**Reincarnate** · 02-9-2014, 11:39 AM

Re: Python pattern matching algorithm tuple error

does this help

Code:

class memoize:
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args, **kwds):
        import cPickle
        str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
        if not self.memo.has_key(str):
            self.memo[str] = self.fn(*args, **kwds)
        return self.memo[str]

#memoization decorator, so any time dpScore(args) is called with arguments we've seen before,
#we don't re-calculate it -- we return the answer that we cached for it
@memoize
def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
    if len(seq1)*len(seq2)==0: #if one or the other is length 0, i.e. an empty string
        return g*abs(len(seq1)-len(seq2)) #then we incur gaps for the length of the difference

    #note that if both strings are length 0, no gap costs are incurred

    #seq1[1:] means "everything but the first character of seq1"
    #This is (iirc) called a string-slice operation. 
    #edit: here http://techearth.net/python/index.php5?title=Python:Basics:Slices

    possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
    possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
    #these two options reflect using a gap at the front of seq2 and seq1, respectively.
    
    possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
    #we aren't using a gap here, but we're looking at the first char of each sequence,
    #using the baseMap to get the correct indices for the matrix, and then using the matrix
    #to get the value we need to add to this score.

    return max([possibleScore1, possibleScore2, possibleScore3])
    #return the max score from the three possibilities



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}
#creates a lookup for the matrix. So baseMap["T"] = 0, baseMap["C"] = 1, etc.

print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')

And yeah when it comes to list-type objects in Python, like L = [1,4,7,3], then doing something like L[2] gives you the third element of the list (due to 0-indexing, L[0] is the first element).

A matrix, here, is just a list of lists, so mat[i][j] is telling you which sub-list and which element of that sub-list to look at.

**Reincarnate** · 02-9-2014, 11:49 AM

Re: Python pattern matching algorithm tuple error

Using a dictionary object to cache results instead of a memoization decorator:

Code:

cache = {} 
'''
cache here is a dictionary object, which means you do cache[key] and you get a corresponding value back. 
The key can be (mostly) anything you want (IIRC, as long as it's an immutable object like a tuple).
A tuple is something like (1,2,3) versus a list [1,2,3]. Tuples are basically "immutable" lists, 
which means they can't be modified directly.

By the way, these triple-apostrophe things are how you do comment-blocks. Pound/hashtag sign is for a line-comment.
'''

def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
    if len(seq1)*len(seq2)==0: #if one or the other is 0, i.e. an empty string
        return g*abs(len(seq1)-len(seq2)) #then we incur gaps for the length of the difference

    #note that if both strings are length 0, no gap costs are incurred

    if (seq1,seq2) in cache: return cache[(seq1,seq2)] 
    #if we've already solved the sub-problem for our two strings, return the answer we cached

    #seq1[1:] means "everything but the first character of seq1"

    possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
    possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
    #these two options reflect using a gap at the front of seq2 and seq1, respectively.
    
    possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
    #we aren't using a gap here, but we're looking at the first char of each sequence,
    #using the baseMap to get the correct indices for the matrix, and then using the matrix
    #to get the value we need to add to this score.

    cache[(seq1,seq2)] = max([possibleScore1, possibleScore2, possibleScore3]) #cache the result
    return cache[(seq1,seq2)]
    #return the max score from the three possibilities



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}
#creates a lookup for the matrix. So baseMap["T"] = 0, baseMap["C"] = 1, etc.

print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')

**DossarLX ODI** · 02-9-2014, 11:51 AM

Re: Python pattern matching algorithm tuple error

Currently looking at your comments to see what you're getting at. From the memoize comments,

Code:

def dpScore():
    # function stuff
dpScore = memoize(dpScore)

@memoize
def dpScore():
    # function stuff

Do the same exact thing then.

**Reincarnate** · 02-9-2014, 12:27 PM

Re: Python pattern matching algorithm tuple error

If that code makes sense then you can try an iterative-based approach to the DP using arrays (in Python, technically lists).

Note that all possible subproblems are basically of form:
(some substring of the first main sequence, some substring of the second main sequence)

where for us, in our example the main sequences are 'ATCGTAGTA' and 'ATGTTAT'

So if sequence 1 is length I and sequence 2 is length J, we can have a matrix of size (I+1)*(J+1) and each element of this matrix will correspond to the maximal score you can get from the substrings. (the +1 on each side is because it is possible to compare a substring against a 0-length string)

So for example, if we call our matrix "dp," then we might have dp[3][1] or something, which means we're looking at the maximal score that comes from the subproblem ("GTAGTA", "TGTTAT"). In other words, the index-3 character onward for the first main sequence, and the index-1 character onward for the second main sequence.

This means, at the very end, dp[0][0] will be our answer.

I'll explain this later when I have time (or someone else can explain it) -- I have to run out for the next few hours:
(I also haven't debugged this yet so there may be a silly error somewhere, I'll fix later)

Code:

mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}


seq1='ATCGTAGTA'
seq2='ATGTTAT'
g=-5


dp = [[0 for j in xrange(len(seq2)+1)] for i in xrange(len(seq1)+1)] #creates a list of lists, which we use as a matrix

for i in xrange(len(seq1)+1):
    dp[i][len(seq2)] = g*abs(len(seq1)-i)  #initialize the cases where we have second substring empty
for j in xrange(len(seq2)+1):
    dp[len(seq1)][j] = g*abs(len(seq2)-j) #initialize the cases where we have first substring empty


#http://www.velocityreviews.com/forums/t636275-how-to-make-a-reverse-for-loop-in-python.html

for i in xrange(len(seq1)-1,-1,-1): #len(seq1)-1 <= i <= 0, iterating down by -1
    for j in xrange(len(seq2)-1,-1,-1): #len(seq2)-1 <= j <= 0, iterating down by -1.
        possibleScore1 = g + dp[i+1][j]
        possibleScore2 = g + dp[i][j+1]
        possibleScore3 = mat[baseMap[seq1[i]]][baseMap[seq2[j]]] + dp[i+1][j+1]
        dp[i][j] = max([possibleScore1, possibleScore2, possibleScore3])
        
print dp[0][0]

**DossarLX ODI** · 02-9-2014, 12:44 PM

Re: Python pattern matching algorithm tuple error

The memoize class constructor stuff in the beginning of the recursive algorithm code will need to be discussed more later, I sent a PM about possibly discussing over some kind of instant messanging program. I'll stick with the iterative solution since it doesn't require constructing a class.

Also your iterative code works.

**Reincarnate** · 02-9-2014, 01:29 PM

Re: Python pattern matching algorithm tuple error

Post any questions you have and I'll address them when I can (I am not at comp atm)

(It's probably best to ignore the solution with the memo class for easiest understanding -- use the one with the cache dict instead, or the iterative translation)