*Important note: WordPress was mangling code that I typed in. For a temporary remedy, I included the complete source file here. However, I believe I have fixed the code display on this page now, so you should be able to use the code as it is displayed here (I tried it myself and it worked).*

Today we’ll see a Python 3 implementation of the MD5 hashing algorithm, which was originally designed by Ron Rivest. I wrote it for my own education and perhaps it can help you as well. Even though the MD5 algorithm has been compromised and is not longer secure, it is still in use today. For example, CS ISOs of Linux distributions often provide MD5 sums in addition to SHA256 sums.

There are a lot of C implementations of MD5 out there. However, they are not easy to read, especially for someone who is not used to C. Python 3, being closer to “math” rather than the computer, reads more like a definition of the algorithm. Of course, the Python implementation is slower than the C one.

This Python implementation is ideal for experimenting with MD5 and learning about it. The code I give below is not the most compact possible, but I believe it is easy to understand.

The complete source code will be at the end of the post, after we go through the code little by little. I assume that you have read or are willing to read the complete description of the MD5 algorithm in RFC 1321 or some similar source, so I’ll be brief.

We’ll start with some basic utilities:

```
import binascii
import sys
import os.path
```

The **binascii** module will help us convert bytes to integers and vice-versa. The others are just so we can use our program on files of our choosing on the command line. To start the actual program, let’s define a bunch of constants.

```
SV = [0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee, 0xf57c0faf,
0x4787c62a, 0xa8304613, 0xfd469501, 0x698098d8, 0x8b44f7af,
0xffff5bb1, 0x895cd7be, 0x6b901122, 0xfd987193, 0xa679438e,
0x49b40821, 0xf61e2562, 0xc040b340, 0x265e5a51, 0xe9b6c7aa,
0xd62f105d, 0x2441453, 0xd8a1e681, 0xe7d3fbc8, 0x21e1cde6,
0xc33707d6, 0xf4d50d87, 0x455a14ed, 0xa9e3e905, 0xfcefa3f8,
0x676f02d9, 0x8d2a4c8a, 0xfffa3942, 0x8771f681, 0x6d9d6122,
0xfde5380c, 0xa4beea44, 0x4bdecfa9, 0xf6bb4b60, 0xbebfbc70,
0x289b7ec6, 0xeaa127fa, 0xd4ef3085, 0x4881d05, 0xd9d4d039,
0xe6db99e5, 0x1fa27cf8, 0xc4ac5665, 0xf4292244, 0x432aff97,
0xab9423a7, 0xfc93a039, 0x655b59c3, 0x8f0ccc92, 0xffeff47d,
0x85845dd1, 0x6fa87e4f, 0xfe2ce6e0, 0xa3014314, 0x4e0811a1,
0xf7537e82, 0xbd3af235, 0x2ad7d2bb, 0xeb86d391]
```

What are these constants? Element **i** of this array is given by **2**32*abs(math.sin(i+1))**, in other words, the first 32 binary digits of $\sin(i)$ for $i=1,2\dots,65$ after the zero. These values are sort of like pseudo-random numbers that will be mashed with the message in various byzantine ways in the compression function. It’s pretty typical for hash functions to use the bits of irrational numbers like the square roots of primes or the digits of pi for these constants. The reasoning is that these numbers have some desirable pseudo-random like properties and have a compact description.

One aspect of MD5 is that it has some rudimentary permuting in terms of circular shifts. So let’s define a “left-circular shift” function that cuts off a bunch of the most significant bytes and places them in the least significant part:

```
def leftCircularShift(k,bits):
bits = bits%32
k = k%(2**32)
upper = (k<<bits)%(2**32)
result = upper | (k>>(32-(bits)))
return(result)
```

What’s with all the 32s? The MD5 algorithm is designed to operate on 32-bit integers. The left circular shift is as a 32-bit integer. That is important, because the binary string 1000 circularly shifted to the left as a four-bit integer is 0001 but as a 32 bit integer it is 10000.

The next thing is that MD5 processes a padded message in 512-bit blocks. But in the actual compression, we divide this 512-bit block further into 16 32-bit blocks. So, let’s suppose we have a message block in binary, and we want to divide it. Let’s write a function for that:

```
def blockDivide(block, chunks):
result = []
size = len(block)//chunks
for i in range(0, chunks):
result.append( int.from_bytes( block[i*size:(i+1)*size],byteorder="little" ))
return(result)
```

Here, we’re using the **int.from_bytes** function to take a sequence of bytes and convert it to an integer. The important thing here is that we specify **byteorder=”little”**. That’s because this is in the MD5 RFC 1321 specification:

[…] a sequence of bytes can be interpreted as a sequence of 32-bit words, where each consecutive group of four bytes is interpreted as a word with the low-order (least significant) byte given first.

If we used **byteorder=”big”**, then we would end up with a different hash function that is very similar to MD5. Now we have to define four bitwise functions that will be used in the compression function of MD5:

```
def F(X,Y,Z):
return( (X&Y)|((~X)&Z) )
def G(X,Y,Z):
return( (X&Z) | (Y&(~Z)) )
def H(X,Y,Z):
return( X^Y^Z )
def I(X,Y,Z):
return( Y^(X|(~Z)) )
```

These will mix up the data of the message with the previous output of the hash in iterations over blocks. To aid in doing so we define more functions:

```
def FF(a,b,c,d,M,s,t):
result = b + leftCircularShift( (a+F(b,c,d)+M+t), s)
return(result)
def GG(a,b,c,d,M,s,t):
result = b + leftCircularShift( (a+G(b,c,d)+M+t), s)
return(result)
def HH(a,b,c,d,M,s,t):
result = b + leftCircularShift( (a+H(b,c,d)+M+t), s)
return(result)
def II(a,b,c,d,M,s,t):
result = b + leftCircularShift( (a+I(b,c,d)+M+t), s)
return(result)
```

Here, **a,b,c,d** will be the previous outputs of the compression function and **M** will be one 32-bit piece of a 512-bit block. The value **t** will be one of the 64 constants we defined earlier in the array **SV** and the **s** will be some circular shift value.

Let’s define two more helper functions:

```
def fmt8(num):
bighex = "{0:08x}".format(num)
binver = binascii.unhexlify(bighex)
result = "{0:08x}".format(int.from_bytes(binver,byteorder='little'))
return(result)
def bitlen( bitstring ):
return(len(bitstring)*8)
```

The function **bitlen** is self-explanatory. The function **fmt8** formats a number as a hexadecimal value with the smallest bytes first. That’s also in the MD5 specification. We will pass four numbers to this function, each representing a 32-bit integer. The final four 32-bit integers will be concatenated to produce the 128-bit hash value.

Let’s define the main hashing algorithm now:

```
def md5sum(msg):
#First, we pad the message
msgLen = bitlen(msg)%(2**64)
msg = msg + b'\x80'
zeroPad = (448 - (msgLen+8)%512)%512
zeroPad //= 8
msg = msg + b'\x00'*zeroPad + msgLen.to_bytes(8,byteorder='little')
msgLen = bitlen(msg)
iterations = msgLen//512
#chaining variables
A = 0x67452301
B = 0xefcdab89
C = 0x98badcfe
D = 0x10325476
#main loop
for i in range(0,iterations):
a = A
b = B
c = C
d = D
block = msg[i*64:(i+1)*64]
M = blockDivide(block,16)
#Rounds
a = FF( a,b,c,d, M[0], 7, SV[0] )
d = FF( d,a,b,c, M[1], 12, SV[1] )
c = FF( c,d,a,b, M[2], 17, SV[2] )
b = FF( b,c,d,a, M[3], 22, SV[3] )
a = FF( a,b,c,d, M[4], 7, SV[4] )
d = FF( d,a,b,c, M[5], 12, SV[5] )
c = FF( c,d,a,b, M[6], 17, SV[6] )
b = FF( b,c,d,a, M[7], 22, SV[7] )
a = FF( a,b,c,d, M[8], 7, SV[8] )
d = FF( d,a,b,c, M[9], 12, SV[9] )
c = FF( c,d,a,b, M[10], 17, SV[10] )
b = FF( b,c,d,a, M[11], 22, SV[11] )
a = FF( a,b,c,d, M[12], 7, SV[12] )
d = FF( d,a,b,c, M[13], 12, SV[13] )
c = FF( c,d,a,b, M[14], 17, SV[14] )
b = FF( b,c,d,a, M[15], 22, SV[15] )
a = GG( a,b,c,d, M[1], 5, SV[16] )
d = GG( d,a,b,c, M[6], 9, SV[17] )
c = GG( c,d,a,b, M[11], 14, SV[18] )
b = GG( b,c,d,a, M[0], 20, SV[19] )
a = GG( a,b,c,d, M[5], 5, SV[20] )
d = GG( d,a,b,c, M[10], 9, SV[21] )
c = GG( c,d,a,b, M[15], 14, SV[22] )
b = GG( b,c,d,a, M[4], 20, SV[23] )
a = GG( a,b,c,d, M[9], 5, SV[24] )
d = GG( d,a,b,c, M[14], 9, SV[25] )
c = GG( c,d,a,b, M[3], 14, SV[26] )
b = GG( b,c,d,a, M[8], 20, SV[27] )
a = GG( a,b,c,d, M[13], 5, SV[28] )
d = GG( d,a,b,c, M[2], 9, SV[29] )
c = GG( c,d,a,b, M[7], 14, SV[30] )
b = GG( b,c,d,a, M[12], 20, SV[31] )
a = HH( a,b,c,d, M[5], 4, SV[32] )
d = HH( d,a,b,c, M[8], 11, SV[33] )
c = HH( c,d,a,b, M[11], 16, SV[34] )
b = HH( b,c,d,a, M[14], 23, SV[35] )
a = HH( a,b,c,d, M[1], 4, SV[36] )
d = HH( d,a,b,c, M[4], 11, SV[37] )
c = HH( c,d,a,b, M[7], 16, SV[38] )
b = HH( b,c,d,a, M[10], 23, SV[39] )
a = HH( a,b,c,d, M[13], 4, SV[40] )
d = HH( d,a,b,c, M[0], 11, SV[41] )
c = HH( c,d,a,b, M[3], 16, SV[42] )
b = HH( b,c,d,a, M[6], 23, SV[43] )
a = HH( a,b,c,d, M[9], 4, SV[44] )
d = HH( d,a,b,c, M[12], 11, SV[45] )
c = HH( c,d,a,b, M[15], 16, SV[46] )
b = HH( b,c,d,a, M[2], 23, SV[47] )
a = II( a,b,c,d, M[0], 6, SV[48] )
d = II( d,a,b,c, M[7], 10, SV[49] )
c = II( c,d,a,b, M[14], 15, SV[50] )
b = II( b,c,d,a, M[5], 21, SV[51] )
a = II( a,b,c,d, M[12], 6, SV[52] )
d = II( d,a,b,c, M[3], 10, SV[53] )
c = II( c,d,a,b, M[10], 15, SV[54] )
b = II( b,c,d,a, M[1], 21, SV[55] )
a = II( a,b,c,d, M[8], 6, SV[56] )
d = II( d,a,b,c, M[15], 10, SV[57] )
c = II( c,d,a,b, M[6], 15, SV[58] )
b = II( b,c,d,a, M[13], 21, SV[59] )
a = II( a,b,c,d, M[4], 6, SV[60] )
d = II( d,a,b,c, M[11], 10, SV[61] )
c = II( c,d,a,b, M[2], 15, SV[62] )
b = II( b,c,d,a, M[9], 21, SV[63] )
A = (A + a)%(2**32)
B = (B + b)%(2**32)
C = (C + c)%(2**32)
D = (D + d)%(2**32)
result = fmt8(A)+fmt8(B)+fmt8(C)+fmt8(D)
return(result)
```

The algorithm first pads the message with a ‘1’ bit, then a bunch of zeros until the padded message length in bits is congruent to 448 modulo 512. Finally, a 64 bit representation of the length of the original message modulo $2^{64}$ is appended to the message in **byteorder=”little”** convention.

The message is then mushed in various intricate ways, one 512-bit block at a time. Each transformation FF, GG, HH, or II is called a round, and there are 64 of them. The results of these 64 rounds are added to a tally modulo $2^{32}$, and the final result of the four tallies makes up the hash. To get this to work as a script, we can add:

```
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Please input filename.")
else:
fname = sys.argv[1]
if not os.path.exists(fname):
print("File does not exist.")
exit()
to_hash = open(fname,"rb")
data = to_hash.read()
print(md5sum(data))
to_hash.close()
```

Here is a test of this algorithm, assuming this code is in a file called **md5sum.py**:

> python3 md5sum.py md5sum.py 4e465ef7dc634f7cac6f3a7fffdd3be8 > md5sum md5sum.py 4e465ef7dc634f7cac6f3a7fffdd3be8 md5sum.py

The second output is of the coreutils md5sum program written in C that is present on my computer. Here is the entire code; feel free to use it for your own education:

```
#!/usr/bin/python3
import binascii
import sys
import os.path
SV = [0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee, 0xf57c0faf,
0x4787c62a, 0xa8304613, 0xfd469501, 0x698098d8, 0x8b44f7af,
0xffff5bb1, 0x895cd7be, 0x6b901122, 0xfd987193, 0xa679438e,
0x49b40821, 0xf61e2562, 0xc040b340, 0x265e5a51, 0xe9b6c7aa,
0xd62f105d, 0x2441453, 0xd8a1e681, 0xe7d3fbc8, 0x21e1cde6,
0xc33707d6, 0xf4d50d87, 0x455a14ed, 0xa9e3e905, 0xfcefa3f8,
0x676f02d9, 0x8d2a4c8a, 0xfffa3942, 0x8771f681, 0x6d9d6122,
0xfde5380c, 0xa4beea44, 0x4bdecfa9, 0xf6bb4b60, 0xbebfbc70,
0x289b7ec6, 0xeaa127fa, 0xd4ef3085, 0x4881d05, 0xd9d4d039,
0xe6db99e5, 0x1fa27cf8, 0xc4ac5665, 0xf4292244, 0x432aff97,
0xab9423a7, 0xfc93a039, 0x655b59c3, 0x8f0ccc92, 0xffeff47d,
0x85845dd1, 0x6fa87e4f, 0xfe2ce6e0, 0xa3014314, 0x4e0811a1,
0xf7537e82, 0xbd3af235, 0x2ad7d2bb, 0xeb86d391]
def leftCircularShift(k,bits):
bits = bits%32
k = k%(2**32)
upper = (k<<bits)%(2**32)
result = upper | (k>>(32-(bits)))
return(result)
def blockDivide(block, chunks):
result = []
size = len(block)//chunks
for i in range(0, chunks):
result.append( int.from_bytes( block[i*size:(i+1)*size],byteorder="little" ))
return(result)
def F(X,Y,Z):
return( (X&Y)|((~X)&Z) )
def G(X,Y,Z):
return( (X&Z) | (Y&(~Z)) )
def H(X,Y,Z):
return( X^Y^Z )
def I(X,Y,Z):
return( Y^(X|(~Z)) )
def FF(a,b,c,d,M,s,t):
result = b + leftCircularShift( (a+F(b,c,d)+M+t), s)
return(result)
def GG(a,b,c,d,M,s,t):
result = b + leftCircularShift( (a+G(b,c,d)+M+t), s)
return(result)
def HH(a,b,c,d,M,s,t):
result = b + leftCircularShift( (a+H(b,c,d)+M+t), s)
return(result)
def II(a,b,c,d,M,s,t):
result = b + leftCircularShift( (a+I(b,c,d)+M+t), s)
return(result)
def fmt8(num):
bighex = "{0:08x}".format(num)
binver = binascii.unhexlify(bighex)
result = "{0:08x}".format(int.from_bytes(binver,byteorder='little'))
return(result)
def bitlen( bitstring ):
return(len(bitstring)*8)
def md5sum(msg):
#First, we pad the message
msgLen = bitlen(msg)%(2**64)
msg = msg + b'\x80'
zeroPad = (448 - (msgLen+8)%512)%512
zeroPad //= 8
msg = msg + b'\x00'*zeroPad + msgLen.to_bytes(8,byteorder='little')
msgLen = bitlen(msg)
iterations = msgLen//512
#chaining variables
A = 0x67452301
B = 0xefcdab89
C = 0x98badcfe
D = 0x10325476
#main loop
for i in range(0,iterations):
a = A
b = B
c = C
d = D
block = msg[i*64:(i+1)*64]
M = blockDivide(block,16)
#Rounds
a = FF( a,b,c,d, M[0], 7, SV[0] )
d = FF( d,a,b,c, M[1], 12, SV[1] )
c = FF( c,d,a,b, M[2], 17, SV[2] )
b = FF( b,c,d,a, M[3], 22, SV[3] )
a = FF( a,b,c,d, M[4], 7, SV[4] )
d = FF( d,a,b,c, M[5], 12, SV[5] )
c = FF( c,d,a,b, M[6], 17, SV[6] )
b = FF( b,c,d,a, M[7], 22, SV[7] )
a = FF( a,b,c,d, M[8], 7, SV[8] )
d = FF( d,a,b,c, M[9], 12, SV[9] )
c = FF( c,d,a,b, M[10], 17, SV[10] )
b = FF( b,c,d,a, M[11], 22, SV[11] )
a = FF( a,b,c,d, M[12], 7, SV[12] )
d = FF( d,a,b,c, M[13], 12, SV[13] )
c = FF( c,d,a,b, M[14], 17, SV[14] )
b = FF( b,c,d,a, M[15], 22, SV[15] )
a = GG( a,b,c,d, M[1], 5, SV[16] )
d = GG( d,a,b,c, M[6], 9, SV[17] )
c = GG( c,d,a,b, M[11], 14, SV[18] )
b = GG( b,c,d,a, M[0], 20, SV[19] )
a = GG( a,b,c,d, M[5], 5, SV[20] )
d = GG( d,a,b,c, M[10], 9, SV[21] )
c = GG( c,d,a,b, M[15], 14, SV[22] )
b = GG( b,c,d,a, M[4], 20, SV[23] )
a = GG( a,b,c,d, M[9], 5, SV[24] )
d = GG( d,a,b,c, M[14], 9, SV[25] )
c = GG( c,d,a,b, M[3], 14, SV[26] )
b = GG( b,c,d,a, M[8], 20, SV[27] )
a = GG( a,b,c,d, M[13], 5, SV[28] )
d = GG( d,a,b,c, M[2], 9, SV[29] )
c = GG( c,d,a,b, M[7], 14, SV[30] )
b = GG( b,c,d,a, M[12], 20, SV[31] )
a = HH( a,b,c,d, M[5], 4, SV[32] )
d = HH( d,a,b,c, M[8], 11, SV[33] )
c = HH( c,d,a,b, M[11], 16, SV[34] )
b = HH( b,c,d,a, M[14], 23, SV[35] )
a = HH( a,b,c,d, M[1], 4, SV[36] )
d = HH( d,a,b,c, M[4], 11, SV[37] )
c = HH( c,d,a,b, M[7], 16, SV[38] )
b = HH( b,c,d,a, M[10], 23, SV[39] )
a = HH( a,b,c,d, M[13], 4, SV[40] )
d = HH( d,a,b,c, M[0], 11, SV[41] )
c = HH( c,d,a,b, M[3], 16, SV[42] )
b = HH( b,c,d,a, M[6], 23, SV[43] )
a = HH( a,b,c,d, M[9], 4, SV[44] )
d = HH( d,a,b,c, M[12], 11, SV[45] )
c = HH( c,d,a,b, M[15], 16, SV[46] )
b = HH( b,c,d,a, M[2], 23, SV[47] )
a = II( a,b,c,d, M[0], 6, SV[48] )
d = II( d,a,b,c, M[7], 10, SV[49] )
c = II( c,d,a,b, M[14], 15, SV[50] )
b = II( b,c,d,a, M[5], 21, SV[51] )
a = II( a,b,c,d, M[12], 6, SV[52] )
d = II( d,a,b,c, M[3], 10, SV[53] )
c = II( c,d,a,b, M[10], 15, SV[54] )
b = II( b,c,d,a, M[1], 21, SV[55] )
a = II( a,b,c,d, M[8], 6, SV[56] )
d = II( d,a,b,c, M[15], 10, SV[57] )
c = II( c,d,a,b, M[6], 15, SV[58] )
b = II( b,c,d,a, M[13], 21, SV[59] )
a = II( a,b,c,d, M[4], 6, SV[60] )
d = II( d,a,b,c, M[11], 10, SV[61] )
c = II( c,d,a,b, M[2], 15, SV[62] )
b = II( b,c,d,a, M[9], 21, SV[63] )
A = (A + a)%(2**32)
B = (B + b)%(2**32)
C = (C + c)%(2**32)
D = (D + d)%(2**32)
result = fmt8(A)+fmt8(B)+fmt8(C)+fmt8(D)
return(result)
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Please input filename.")
else:
fname = sys.argv[1]
if not os.path.exists(fname):
print("File does not exist.")
exit()
to_hash = open(fname,"rb")
data = to_hash.read()
print(md5sum(data))
to_hash.close()
```

## 3 Comments

Sir in python 3 this line are throwing errors how can we handle it

msg = msg + b’\x80′

msg = msg + b’\x00’*zeroPad + msgLen.to_bytes(8,byteorder=’little’)

Make sure you are using the file I linked to at the beginning of the post. WordPress is mangling some of the code due to the way it tries to convert to HTML so that is why I linked to a version that works. Make sure you use Python 3. Someone else and myself have already tested the code linked at the top and it has worked for us so please check that. I hope it works for you.

Thank you so much. It has been useful to me