About RSSIL
Two weeks ago, the team HzV was in Maubeuge (France) to attend to the RSSIL[0]. This event was really nice. Moreover, we have seen some interesting talks, workshops and a great Capture The Flag.
Among workshops, you could find tables for lock-picking techniques, basic web PoCs, Live-Box security, osmocombb with Renaud Lifchitz and so on.
The Hacknowledge challenge has started at 9pm and then every team were fighting against each over. Web/Application vulnerabilities, Reversing, Hardware and Network security, Social engineering… I never seen a challenge as varied as the Hacknowledge one in France! Moreover, if you’re interested to learn more about this event, read Emilien Girault’s post[1] (In French: “an obscure language used only by some 3% of the population of the planet…“ – Joanna Rutkowska).
In this article, I will only focus on the captcha part, which was a little bit harder to break in a short time.
Des chiffres et des lettres
Translation: “numbers and letters”
Indeed, seeing the title you are surely thinking about the strange and crazy TV game[jff_1] (We have the same in France too![jff_2])
Beginning this challenge, we goind to the subscription page:
As we can see, to validate it, we got to submit 5 forms with a correct captcha within 2 secondes, and after that, the counter resets. If we could type everything and send 5 forms in 2 secs, we would feel like my friend Chuck Norris, but as a simple human as we are, it will be a little bit more tricky…
First and foremost, we were looking for a way to pull down this captcha or any weakness. So we going to the page “code.php” which displays the catpcha, and see what? This captcha is generated with random numbers and letters (remember ESET crackme subscription form…). If you look in COOKIES, there is a classic Session ID, and also if you try to send a form using your super fingers , the website will throw you this error: “ Trop tard !” (Too late!).
The session ID stays the same, so we could imagine that the image code and the timestamp are store in our session. To break this captcha, one possible solution is to download the actual captcha image and apply an OCR technique.
The OCR
OCR is an acronym for Optical Character Recognition. This technique is used to convert books, documents and scanned letters into electronic datas.
An example with Mathematica:
With a free tool like “tesseract”, we are able to reproduct the same result:
Would he poop on my
kneeeee?
fluxius@wwitb:~/captcha$
Image processing
This is the interesting part. Indeed, if we try to recognize our downloaded captchas, we could observe that tesseract and even mathematica are only able to read few letters. These letters are for the most part, not rotated… So if we would like to break this type of “captcha”, we got to separate letters from white background and rotate them until we could read any letter.
Fist and foremost, we need a sample, so we will download one:
I said before we have to separate letters from the background. Indeed, the yellow color is difficult to read and the OCR works quite better with black letters in a white background. Thanks to Renaud Lifchitz’s sample, I could do the separation with Mathematica:
If the background is noisy, it is a little more complicated, because we have to take each letter by its color.
Having a list of letters in black, we got to rotate them (ImageRotate) to perform the recognition with the OCR function:
TextRecognize is used to convert a scanned page in a book (we can specify also a language to perform the recognition), so if the rotation is not perfect, it will not matter.
To finish, as we should know, TextRecognize function does not work for a single letter, and that implies to re-assemble our letters and then use this OCR function, as shown below:
The letter “i” has been transformed by “|]”. This is strange, but if we apply a white padding, TextRecognized will only display the number “3”. So there is some solutions like replacing the string “|]” to “i” (but if you know how to resolve this properly: Tell me!).
Breaking captchas using Python
After this analysis, we will do the same processing, but only using python which has a lot of tricks up its sleeve.
We need to separate each letter from the background filtering only the white color. To do this, we open the image file and analyze its histogram, which is a list of pixel counts.
im = Image.open("code.php.png")
im = im.convert("P") # Converts into GIF (255 colors)
print im.histogram()
The result:
There is 9285 pixels for the Color ID “4”. We suppose this color ID matches with white, because the background is essentially white (deuh…).
To be sure, we will separate each letter from the background:
for x in range(im.size[1]):
for y in range(im.size[0]):
pix = im.getpixel((y,x))
temp[pix] = pix
if pix != 4:
im2.putpixel((y,x),0)
Using the method “show()” on “im2”, we display the result:
After that, we trying to read it with tesseract:
Z%_;i%L ← DO YOU SPEAK ENGLISH? Oo
The second, third and fifth characters are not included in [a-zA-Z0-9], so if we want to automate the processing, we can use a simple regex to detect if each character is a letter or a number.
To be recognized, like the previous example with Mathematica, we will rotate letters which are not matched with our predefined alphabet:
'''
Rotate letters with a list of angles
image - Image to rotate
rotation - list of angles
'''
for rotation in rotations:
torot = image.convert('RGBA')
rot = torot.rotate(rotation, expand=1)
fff = Image.new('RGBA', rot.size, (255,)*4)
out = Image.composite(rot, fff, rot)
image = out.convert(image.mode)
alpha = re.match(r"[a-zA-Z0-9]", image_to_string(image))
try:
return (image, alpha.group(0))
except:
pass
inletter = False
foundletter=False
start = 0
end = 0
imList = []
for y in range(im2.size[0]):
for x in range(im2.size[1]):
pix = im2.getpixel((y,x))
if pix != 255:
inletter = True
if foundletter == False and inletter == True:
foundletter = True
start = y
if foundletter == True and inletter == False:
foundletter = False
end = y
im3 = im2.crop((start, 0, end, im2.size[1] ))
imList.append(im3)
inletter=False
imList[1] = rotc(imList[1], [-30, 30])[0]
imList[2] = rotc(imList[2], [30, -30])[0]
imList[4] = rotc(imList[4], [-30, 30])[0]
The result:
Captcha: | ![]() |
Text: | ZB3iyL |
#Win!
False positive (paradox of OCRs)
Yes… I have cheated a little bit using a “manual algorithm” to rotate letters and recognized them using tesseract. But! Letters are rotated randomly, so lets write a kind of (dirty) intelligence:
alpha = re.match(r"[a-zA-Z0-9]", string[n])
if alpha is None:
imList[n] = rotc(imList[n], [-30, 30])[0]
new_image = buildimg()
string2 = image_to_string(new_image)
print string2
string = string2.replace(" ", "")
Result:
First reading: Z%_;i%L</p>
Tesseract Open Source OCR Engine
Tesseract Open Source OCR Engine
Z B `;i€L
Tesseract Open Source OCR Engine
Tesseract Open Source OCR Engine
ZB,,,i%L
Tesseract Open Source OCR Engine
Tesseract Open Source OCR Engine
ZB,,I'%L
ZB,,I'%L ← The final string
#Fail!
But if we add a static stub:
if alpha is None:
if n == 2:
imList[n] = rotc(imList[n], [30, -30])[0]
else:
imList[n] = rotc(imList[n], [-30, 30])[0]
…
We got: “ZB3iyL” as expected. #win?
Vector comparisons
Using the method “getdata()” on any image, you get something like that:
Seeing image’s datas, our solution will be to divide each letter by 4, taken from the background and compare the two cells on the top (good idea! Isn’t it?). I mean:
We suppose if the first cell in the left has less black pixels than the right cell, then we got to do a rotation in the left. Otherwise in the right:
black2 = 0
for y in range(imList[1].size[0]/2):
for x in range(imList[1].size[1]):
pix = imList[1].getpixel((y,x))
if pix == 0:
if x > imList[1].size[1]/2:
black2 += 1
else:
black1 +=1
print 'Colors (left right) :', black1, black2
And normally we should get more pixels in the right than in the left:
To finish, we change our previous stub:
if the_balance[0] < the_balance[1]:
imList[n] = rotc(imList[n], [30, -30])[0]
else:
imList[n] = rotc(imList[n], [-30, 30])[0]
And we get a full automated “des chiffres et des lettres” captcha breaker:
ZB 3iyL
ZB3iyL
#EpicWin!
This technique works positively for a large number of tries, but fails a little when a letter is confused by another with the OCR.
Ressources
Sources: CaptchaBreaker.tar.gz (Warning! It’s very dirty ;))
Mathematica Notebook: captcha.nb
pytesser: http://code.google.com/p/pytesser/
References
[0] RSSIL Website – http://www.rssil.org/
[1] RSSIL 2011 Write-Ups – http://www.segmentationfault.fr/securite-informatique/rssil-2011-write-ups/
[2] The Incredible Convenience of Mathematica Image Processing – http://blog.wolfram.com/2008/12/01/the-incredible-convenience-of-mathematica-image-processing/
[3] Mathematica, Image Processing & Analysis – http://reference.wolfram.com/mathematica/guide/ImageProcessing.html
[4] Decoding Captcha – http://www.wausita.com/captcha/
[5] Tesseract – http://code.google.com/p/tesseract-ocr/
Just for Fun
[jjf_1] Numbers and letters – http://www.youtube.com/watch?v=ViFd9fyjCZk
[jjf_2] Des Chiffres et des lettres – http://www.youtube.com/watch?v=v96Hovtz7DM
Salut fluxius 🙂
I also code OCR for some popular CMS, and your solution for letter derotation (divide the picture by 4…) looks nice.
Thank you pierz =)
There is a lot of possibilities to resolve this kind of captcha, including the heavy generation of sample to be compared with the current image. This one is simple and effective, but we can improve much more this algorithm.
If you have some interesting OCR for difficult captchas, let me know because I’m studying some cases for better recognitions of letters, elements in a picture and so on.
By the way, you might be interested in Image processing projects?