RSSIL [captcha]: Des chiffres et des lettres

About RSSIL

Two weeks ago, the team HzV was in Maubeuge (France) to attend to the RSSIL[0]. This event was really nice. Moreover, we have seen some interesting talks, workshops and a great Capture The Flag.

Among workshops, you could find tables for lock-picking techniques, basic web PoCs, Live-Box security, osmocombb with Renaud Lifchitz and so on.

The Hacknowledge challenge has started at 9pm and then every team were fighting against each over. Web/Application vulnerabilities, Reversing, Hardware and Network security, Social engineering… I never seen a challenge as varied as the Hacknowledge one in France! Moreover, if you’re interested to learn more about this event, read Emilien Girault’s post[1] (In French: “an obscure language used only by some 3% of the population of the planet…“ – Joanna Rutkowska).

In this article, I will only focus on the captcha part, which was a little bit harder to break in a short time.

Des chiffres et des lettres

Translation: “numbers and letters”

Indeed, seeing the title you are surely thinking about the strange and crazy TV game[jff_1] (We have the same in France too![jff_2])

Beginning this challenge, we goind to the subscription page:

As we can see, to validate it, we got to submit 5 forms with a correct captcha within 2 secondes, and after that, the counter resets. If we could type everything and send 5 forms in 2 secs, we would feel like my friend Chuck Norris, but as a simple human as we are, it will be a little bit more tricky…

First and foremost, we were looking for a way to pull down this captcha or any weakness. So we going to the page “code.php” which displays the catpcha, and see what? This captcha is generated with random numbers and letters (remember ESET crackme subscription form…). If you look in COOKIES, there is a classic Session ID, and also if you try to send a form using your super fingers , the website will throw you this error: “ Trop tard !” (Too late!).

The session ID stays the same, so we could imagine that the image code and the timestamp are store in our session. To break this captcha, one possible solution is to download the actual captcha image and apply an OCR technique.

The OCR

OCR is an acronym for Optical Character Recognition. This technique is used to convert books, documents and scanned letters into electronic datas.

An example with Mathematica:

With a free tool like “tesseract”, we are able to reproduct the same result:

Tesseract Open Source OCR Engine
Would he poop on my
kneeeee?

fluxius@wwitb:~/captcha$

Image processing

This is the interesting part. Indeed, if we try to recognize our downloaded captchas, we could observe that tesseract and even mathematica are only able to read few letters. These letters are for the most part, not rotated… So if we would like to break this type of “captcha”, we got to separate letters from white background and rotate them until we could read any letter.

Fist and foremost, we need a sample, so we will download one:

I said before we have to separate letters from the background. Indeed, the yellow color is difficult to read and the OCR works quite better with black letters in a white background. Thanks to Renaud Lifchitz’s sample, I could do the separation with Mathematica:

If the background is noisy, it is a little more complicated, because we have to take each letter by its color.

Having a list of letters in black, we got to rotate them (ImageRotate) to perform the recognition with the OCR function:

TextRecognize is used to convert a scanned page in a book (we can specify also a language to perform the recognition), so if the rotation is not perfect, it will not matter.

To finish, as we should know, TextRecognize function does not work for a single letter, and that implies to re-assemble our letters and then use this OCR function, as shown below:

The letter “i” has been transformed by “|]”. This is strange, but if we apply a white padding, TextRecognized will only display the number “3”. So there is some solutions like replacing the string “|]” to “i” (but if you know how to resolve this properly: Tell me!).

Breaking captchas using Python

After this analysis, we will do the same processing, but only using python which has a lot of tricks up its sleeve.

We need to separate each letter from the background filtering only the white color. To do this, we open the image file and analyze its histogram, which is a list of pixel counts.

from PIL import Image
im = Image.open("code.php.png")
im = im.convert("P") # Converts into GIF (255 colors)
print im.histogram()

The result:

[0, 87, 118, 89, 9285, 0, 21, 24, 15, 18, 21, 18, 16, 30, 24, 28, 30, 21, 19, 15, 24, 15, 25, 17, 9, 17, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

There is 9285 pixels for the Color ID “4”. We suppose this color ID matches with white, because the background is essentially white (deuh…).

To be sure, we will separate each letter from the background:

temp = {}
for x in range(im.size[1]):
  for y in range(im.size[0]):
    pix = im.getpixel((y,x))
    temp[pix] = pix
    if pix != 4:
      im2.putpixel((y,x),0)

Using the method “show()” on “im2”, we display the result:

After that, we trying to read it with tesseract:

Tesseract Open Source OCR Engine
Z%_;i%L ← DO YOU SPEAK ENGLISH? Oo

The second, third and fifth characters are not included in [a-zA-Z0-9], so if we want to automate the processing, we can use a simple regex to detect if each character is a letter or a number.

To be recognized, like the previous example with Mathematica, we will rotate letters which are not matched with our predefined alphabet:

def rotc(image, rotations):
  '''
     Rotate letters with a list of angles
     image - Image to rotate
     rotation - list of angles
  '''

  for rotation in rotations:
    torot = image.convert('RGBA')
    rot = torot.rotate(rotation, expand=1)
    fff = Image.new('RGBA', rot.size, (255,)*4)
    out = Image.composite(rot, fff, rot)
    image = out.convert(image.mode)
    alpha = re.match(r"[a-zA-Z0-9]", image_to_string(image))
    try:
      return (image, alpha.group(0))
    except:
      pass
 
inletter = False
foundletter=False
start = 0
end = 0
imList = []
for y in range(im2.size[0]):
  for x in range(im2.size[1]):
    pix = im2.getpixel((y,x))
    if pix != 255:
      inletter = True
  if foundletter == False and inletter == True:
    foundletter = True
    start = y
  if foundletter == True and inletter == False:
    foundletter = False
    end = y
    im3 = im2.crop((start, 0, end, im2.size[1] ))
    imList.append(im3)
  inletter=False

imList[1] = rotc(imList[1], [-30, 30])[0]
imList[2] = rotc(imList[2], [30, -30])[0]
imList[4] = rotc(imList[4], [-30, 30])[0]

The result:

Captcha:
Text: ZB3iyL

#Win!

False positive (paradox of OCRs)

Yes… I have cheated a little bit using a “manual algorithm” to rotate letters and recognized them using tesseract. But! Letters are rotated randomly, so lets write a kind of (dirty) intelligence:

for n in range(5):
  alpha = re.match(r"[a-zA-Z0-9]", string[n])
  if alpha is None:
    imList[n] = rotc(imList[n], [-30, 30])[0]
    new_image = buildimg()
    string2 = image_to_string(new_image)
    print string2
    string = string2.replace(" ", "")

Result:

Tesseract Open Source OCR Engine
First reading: Z%_;i%L</p>
Tesseract Open Source OCR Engine
Tesseract Open Source OCR Engine
Z B `;i€L

Tesseract Open Source OCR Engine
Tesseract Open Source OCR Engine
ZB,,,i%L

Tesseract Open Source OCR Engine
Tesseract Open Source OCR Engine
ZB,,I'%L

ZB,,I'%L ← The final string

#Fail!

But if we add a static stub:

...
 if alpha is None:
    if n == 2:
      imList[n] = rotc(imList[n], [30, -30])[0]
    else:
      imList[n] = rotc(imList[n], [-30, 30])[0]

We got: “ZB3iyL” as expected. #win?

Vector comparisons

Using the method “getdata()” on any image, you get something like that:

..255, 255, 255, 255, 255, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0,..

Seeing image’s datas, our solution will be to divide each letter by 4, taken from the background and compare the two cells on the top (good idea! Isn’t it?). I mean:

We suppose if the first cell in the left has less black pixels than the right cell, then we got to do a rotation in the left. Otherwise in the right:

black1 = 0
black2 = 0
for y in range(imList[1].size[0]/2):
  for x in range(imList[1].size[1]):
    pix = imList[1].getpixel((y,x))
    if pix == 0:
      if x > imList[1].size[1]/2:
        black2 += 1
      else:
        black1 +=1

print 'Colors (left right) :', black1, black2

And normally we should get more pixels in the right than in the left:

Colors (left right) : 0 26 # Win!

To finish, we change our previous stub:

  if alpha is None:
    if the_balance[0] < the_balance[1]:
      imList[n] = rotc(imList[n], [30, -30])[0]
    else:
      imList[n] = rotc(imList[n], [-30, 30])[0]

And we get a full automated “des chiffres et des lettres” captcha breaker:

Tesseract Open Source OCR Engine
ZB 3iyL

ZB3iyL

#EpicWin!

This technique works positively for a large number of tries, but fails a little when a letter is confused by another with the OCR.

Ressources

Sources: CaptchaBreaker.tar.gz (Warning! It’s very dirty ;))
Mathematica Notebook: captcha.nb
pytesser: http://code.google.com/p/pytesser/

References

[0] RSSIL Website – http://www.rssil.org/
[1] RSSIL 2011 Write-Ups – http://www.segmentationfault.fr/securite-informatique/rssil-2011-write-ups/
[2] The Incredible Convenience of Mathematica Image Processing – http://blog.wolfram.com/2008/12/01/the-incredible-convenience-of-mathematica-image-processing/
[3] Mathematica, Image Processing & Analysis – http://reference.wolfram.com/mathematica/guide/ImageProcessing.html
[4] Decoding Captcha – http://www.wausita.com/captcha/
[5] Tesseract – http://code.google.com/p/tesseract-ocr/

Just for Fun

[jjf_1] Numbers and letters – http://www.youtube.com/watch?v=ViFd9fyjCZk
[jjf_2] Des Chiffres et des lettres – http://www.youtube.com/watch?v=v96Hovtz7DM

This entry was posted in General, Image Processing and tagged , , . Bookmark the permalink.

2 Responses to RSSIL [captcha]: Des chiffres et des lettres

  1. pierz says:

    Salut fluxius 🙂

    I also code OCR for some popular CMS, and your solution for letter derotation (divide the picture by 4…) looks nice.

    • FlUxIuS says:

      Thank you pierz =)
      There is a lot of possibilities to resolve this kind of captcha, including the heavy generation of sample to be compared with the current image. This one is simple and effective, but we can improve much more this algorithm.
      If you have some interesting OCR for difficult captchas, let me know because I’m studying some cases for better recognitions of letters, elements in a picture and so on.
      By the way, you might be interested in Image processing projects?