Python pattern matching algorithm tuple error

**beary605** · 02-8-2014, 11:22 PM

Re: Python pattern matching algorithm tuple error

A really simple error: you need to use map[x][y] instead of map[x,y] ("x,y" gets interpreted as (x, y), so you access map[ (x, y) ] which Python doesn't like)

I'll look over the rest of the code to see if there's any other errors

edit: wow I did not know you could put commas like that in splices... see if the above works if not then... hm

**DossarLX ODI** · 02-9-2014, 08:29 AM

Re: Python pattern matching algorithm tuple error

Originally posted by beary605

A really simple error: you need to use map[x][y] instead of map[x,y] ("x,y" gets interpreted as (x, y), so you access map[ (x, y) ] which Python doesn't like)

I changed it to what you said and got this, thanks

>>> t1 = stringScore( mat, alphabet, seq1 , seq2 , g = -5 )
>>> t1
65
>>>

Which looks correct, because the two strings would result in:
20 + 20 + (-5) + (-5) + 20 + 20 + (-5) = 80 - 15 = 65
Those negatives being the mismatches.

Now I'll see what goes wrong with the other functions. I know that the scoring function works when the strings are aligned, so that's a good start.

Edit: Ok, another problem

Code:

>>> from numpy import *

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    from numpy import *
  File "C:\Python27\lib\site-packages\numpy\__init__.py", line 153, in <module>
    from . import add_newdocs
  File "C:\Python27\lib\site-packages\numpy\add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "C:\Python27\lib\site-packages\numpy\lib\__init__.py", line 8, in <module>
    from .type_check import *
  File "C:\Python27\lib\site-packages\numpy\lib\type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "C:\Python27\lib\site-packages\numpy\core\__init__.py", line 6, in <module>
    from . import multiarray
ImportError: DLL load failed: %1 is not a valid Win32 application.

I'm trying to use numpy but there's that ImportError message. I downloaded a 64-bit version of opencv but that still didn't solve it. I am running a 64-bit machine (Windows 7) and I know numpy is 32-bit but there doesn't seem to be much on handling this online, maybe someone else can help?

**Reincarnate** · 02-9-2014, 08:34 AM

Re: Python pattern matching algorithm tuple error

I am way too lazy to read through all this, can you just explain what you need to do in like 2 sentences -- I'll whip up the solution in Python real quick and you can look at it or something

you have two sequences
seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'

what's the score and why

(or are you trying to do something else?)

**Reincarnate** · 02-9-2014, 08:57 AM

Re: Python pattern matching algorithm tuple error

something like this maybe

Code:

def stringScore( mat, baseMap, seq1, seq2, g = -5 ):

    if len(seq1) != len(seq2):
        raise Exception("Fix your shit, both strings must be of equal length")

    score = 0
    
    for base1,base2 in zip(seq1,seq2):
        if base1 != base2:
            score -= mat[baseMap[base1]][baseMap[base2]]
        elif base1 == "-" or base2 == "-":
            score += g
        else:
            score += mat[baseMap[base1]][baseMap[base2]]
        
    return score



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}

seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'

print stringScore(mat, baseMap, seq1, seq2)

**DossarLX ODI** · 02-9-2014, 08:58 AM

Re: Python pattern matching algorithm tuple error

Originally posted by Reincarnate

I am way too lazy to read through all this, can you just explain what you need to do in like 2 sentences -- I'll whip up the solution in Python real quick and you can look at it or something

you have two sequences
seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'

what's the score and why

(or are you trying to do something else?)

Yeah I can make a tl;dr version, my post above also shows an error when I try importing numpy. However I still have to give lots of details here because this assignment involves creating a path

The program is supposed to work for any two sequences e.g.
ATGAATGCGATTTCGGGTGGCC
TTGGCAGGACATGAAGTTCGATACGGAA

Obviously the first string is too short so that's when we have to insert gaps and there are gap penalties to make it the same length. The gap penalty is just some integer, say -5.

In a nutshell we're doing a 5-way comparison:
- Gaps.
- A, G, C, or T (those are the four letters).

This is the score table (substitution matrix).

In dynamic programming you're considering an optimal sequence alignment by considering three options for each location in the sequence: down, right, or down-right. Visual example:

So basically, we have our substitution matrix (4x4) which has the values for a pair of letters found, listed in the score table. Then the scoring matrix maintains the current alignment score for the particular alignment. The arrow matrix determines the optimal score path (It is created at the same time the scoring matrix is). Then we backtrace that path to get the strings aligned so they're the same length, and finally use the score function.

**Reincarnate** · 02-9-2014, 09:04 AM

Re: Python pattern matching algorithm tuple error

so like this?

Code:

def stringScore( mat, baseMap, seq1, seq2, g = -5 ):
    score = g*abs(len(seq1)-len(seq2)) #if one string is longer than the other, gap penalty = g*(difference in lengths)
    
    for base1,base2 in zip(seq1,seq2):
        if base1 != base2:
            score -= mat[baseMap[base1]][baseMap[base2]]
        else:
            score += mat[baseMap[base1]][baseMap[base2]]
    
    return score

sorry still in tl;dr mode

are you trying to basically say "we don't know where the real gaps are, so let's try all possible ways of inserting gaps to determine the maximal possible score?"

ATGAATGCGATTTCGGGTGGCC-------
TTGGCAGGACATGAAGTTCGATACGGAA

probably a shitty score due to all mismatches, but if we had

-ATG-AATG-CGATT-TC-GG--GTGGCC
TTGGCAGGACATGAAGTTCGATACGGAA

or something (just making up a shitty example), then the gaps could possibly be matches and we'd have a better score

**DossarLX ODI** · 02-9-2014, 09:06 AM

Re: Python pattern matching algorithm tuple error

Originally posted by Reincarnate

are you trying to basically say "we don't know where the real gaps are, so let's try all possible ways of inserting gaps to determine the maximal possible score?"

ATGAATGCGATTTCGGGTGGCC-------
TTGGCAGGACATGAAGTTCGATACGGAA

probably a shitty score due to all mismatches, but if we had

-ATG-AATG-CGATT-TC-GG--GTGGCC
TTGGCAGGACATGAAGTTCGATACGGAA

or something (just making up a shitty example), then the gaps could possibly be matches and we'd have a better score

Yes, the gaps are supposed to align the sequences better. From a practical standpoint we also don't readily know where the gaps are so they have to be inserted based on how much the score will improve.

The score is calculated at the end. The scoring matrix is just made so the arrow matrix can be made to determine the best path for alignment. Then after those paths are backtraced (and then the strings reversed), the alignments are sent to the score function (essentially we're making a new string with dashes in it to indicate gaps since it was too short).

Gaps here are a pain in the ass to work with. The gaps are considered by the arrow matrix which considers the best possible path by checking how the score changes, because like I mentioned some strings aren't of equal length and that's when the gaps come in.

I guess another way of tl;dr this is that it's getting two strings with four possible letters, and if they are not the same length then gaps have to be inserted to make the letters line up better for an optimal score.

**Reincarnate** · 02-9-2014, 09:26 AM

Re: Python pattern matching algorithm tuple error

So if two bases match, they basically add the corresponding matrix value (typically 20), but if they mismatch, they subtract the matrix value?

And using a gap incurs a -5 penalty?

**DossarLX ODI** · 02-9-2014, 09:29 AM

Re: Python pattern matching algorithm tuple error

Originally posted by Reincarnate

So if two bases match, they basically add the corresponding matrix value (typically 20), but if they mismatch, they subtract the matrix value?

And using a gap incurs a -5 penalty?

From my understanding yes. It makes no sense that the score would increase if there is a mismatch, and it should subtract the matrix value. Subtracting 1 from the score for a mismatch sounds incredibly stupid and I'll be mad if that's actually the case here, but just go by the matrix values in that table.

Edit: Back. That was faster than I planned

**Reincarnate** · 02-9-2014, 09:48 AM

Re: Python pattern matching algorithm tuple error

Untested code, let me know if this is returning the right values for you -- if so then I'll clean up the code and make it a little more readable / with better coding practices / whatever -- just want to ensure it's returning the right vals first

Code:

class memoize:
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args, **kwds):
        import cPickle
        str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
        if not self.memo.has_key(str):
            self.memo[str] = self.fn(*args, **kwds)
        return self.memo[str]



@memoize
def dpScore( mat, baseMap, seq1, seq2, gapsLeft, g ):
    if (len(seq1)==0 and gapsLeft>0) or gapsLeft<0: return -9999999999999999999999 #lazy right now, using a sentinel
    if len(seq1)==0 or len(seq2)==0: return 0

    possibleScore1 = g + mat[baseMap[seq1[0]]][baseMap[seq1[0]]] + dpScore( mat, baseMap, seq1[1:], seq2, gapsLeft-1, g )

    if seq1[0] == seq2[0]: matchScore = mat[baseMap[seq1[0]]][baseMap[seq2[0]]]
    else: matchScore = -mat[baseMap[seq1[0]]][baseMap[seq2[0]]]

    possibleScore2 = matchScore + dpScore( mat, baseMap, seq1[1:], seq2[1:], gapsLeft, g )
    
    return max(possibleScore1, possibleScore2)



def bestStringScore( mat, baseMap, seq1, seq2, g = -5 ):
    gapsLeft = abs(len(seq1)-len(seq2))
    if len(seq1) > len(seq2):
        return dpScore(mat, baseMap, seq1, seq2, gapsLeft, g)
    else:
        return dpScore(mat, baseMap, seq2, seq1, gapsLeft, g)



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}

seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'

print bestStringScore(mat, baseMap, seq1, seq2)

**DossarLX ODI** · 02-9-2014, 09:53 AM

Re: Python pattern matching algorithm tuple error

Yes, copied that code snippet and got 65 for those sequences (same with what I got).

**Reincarnate** · 02-9-2014, 10:06 AM

Re: Python pattern matching algorithm tuple error

I mean ideally you should also test with sequences that are of different lengths and then check to see if the resulting maximum score is what you get by hand (inserting gaps manually etc)

**DossarLX ODI** · 02-9-2014, 10:11 AM

Re: Python pattern matching algorithm tuple error

I'm testing it with

seq1 = 'ATGTTAT'
seq2 = 'ATCGTAGTA'

seq1 = 'ATGTTA-T-'
seq2 = 'ATCGTAGTA'

20 + 20 - 5 - 5 + 20 + 20 - 5 + 20 - 5 = 80

seq1 = 'AT-GTTAT-'
seq2 = 'ATCGTAGTA'

20 + 20 - 5 + 20 + 20 - 5 - 10 + 20 -5 = 75

ATGTTA-T- would be the optimal sequence here. The second solution was slightly lower since AG is worth 10, not 5.

When I input into the program it said 105.

Edit: Ok so what I consider a possibility is that it did this:

AT-GTTAT-
ATCGTAGTA

20 + 20 - 5 + 20 + 20 + 5 + 10 + 20 - 5 = 105

Edit 2: Optimal score looks like 100 actually.

seq1 = 'AT-GTTA-T-'
seq2 = 'ATCGT-AGTA'

20 + 20 - 5 + 20 + 20 - 5 + 20 - 5 + 20 - 5 = 100

**Reincarnate** · 02-9-2014, 10:43 AM

Re: Python pattern matching algorithm tuple error

I think maybe I am not clear how you are supposed to be scoring stuff

is this right?

Code:

seq1: --ATGTTAT
seq2: ATCGTAGTA
score: -45

seq1: -A-TGTTAT
seq2: ATCGTAGTA
score: -45

seq1: -AT-GTTAT
seq2: ATCGTAGTA
score: -50

seq1: -ATG-TTAT
seq2: ATCGTAGTA
score: -25

seq1: -ATGT-TAT
seq2: ATCGTAGTA
score: 0

seq1: -ATGTT-AT
seq2: ATCGTAGTA
score: 0

seq1: -ATGTTA-T
seq2: ATCGTAGTA
score: -5

seq1: -ATGTTAT-
seq2: ATCGTAGTA
score: 20

seq1: A--TGTTAT
seq2: ATCGTAGTA
score: -20

seq1: A-T-GTTAT
seq2: ATCGTAGTA
score: -25

seq1: A-TG-TTAT
seq2: ATCGTAGTA
score: 0

seq1: A-TGT-TAT
seq2: ATCGTAGTA
score: 25

seq1: A-TGTT-AT
seq2: ATCGTAGTA
score: 25

seq1: A-TGTTA-T
seq2: ATCGTAGTA
score: 20

seq1: A-TGTTAT-
seq2: ATCGTAGTA
score: 45

seq1: AT--GTTAT
seq2: ATCGTAGTA
score: 5

seq1: AT-G-TTAT
seq2: ATCGTAGTA
score: 30

seq1: AT-GT-TAT
seq2: ATCGTAGTA
score: 55

seq1: AT-GTT-AT
seq2: ATCGTAGTA
score: 55

seq1: AT-GTTA-T
seq2: ATCGTAGTA
score: 50

seq1: AT-GTTAT-
seq2: ATCGTAGTA
score: 75

seq1: ATG--TTAT
seq2: ATCGTAGTA
score: 5

seq1: ATG-T-TAT
seq2: ATCGTAGTA
score: 30

seq1: ATG-TT-AT
seq2: ATCGTAGTA
score: 30

seq1: ATG-TTA-T
seq2: ATCGTAGTA
score: 25

seq1: ATG-TTAT-
seq2: ATCGTAGTA
score: 50

seq1: ATGT--TAT
seq2: ATCGTAGTA
score: 5

seq1: ATGT-T-AT
seq2: ATCGTAGTA
score: 5

seq1: ATGT-TA-T
seq2: ATCGTAGTA
score: 0

seq1: ATGT-TAT-
seq2: ATCGTAGTA
score: 25

seq1: ATGTT--AT
seq2: ATCGTAGTA
score: 30

seq1: ATGTT-A-T
seq2: ATCGTAGTA
score: 25

seq1: ATGTT-AT-
seq2: ATCGTAGTA
score: 50

seq1: ATGTTA--T
seq2: ATCGTAGTA
score: 55

seq1: ATGTTA-T-
seq2: ATCGTAGTA
score: 80

seq1: ATGTTAT--
seq2: ATCGTAGTA
score: 55

max score 80?

score: +matrix[base1][base2] if base1 and base2 match (e.g. "AA"), -matrix[base1][base2] if they don't match (e.g. "GC"), and -5 if we have a gap (e.g. "T-")

Wait you can insert gaps in both sequences, or just the shorter of the two?

(I also don't want to waste your time if I am going off-track with all this -- feel free to ignore if you're on a tight deadline)

Python pattern matching algorithm tuple error

Python pattern matching algorithm tuple error

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment