Home Google CTF | Mixed
Ctfpost
Cancel

Google CTF | Mixed

A company I interview for gave me a test to solve. I'd decompile it and get the answer that way, but they seem to use some custom Python version... At least I know it's based on commit 64113a4ba801126028505c50a7383f3e9df29573.

I really enjoyed this challenge - it was a fun puzzle to figure out! f0xtr0t and I worked on it together, and we were able to contribute different skillsets to solve it. You can see his write-up here, which covers a lot of the tooling side of things I skim over.

On to the challenge! We’re given x.pyc and patch. But it seems like we can’t run this .pyc file with the given Python version for some reason…

A Bit About Python

What is a .pyc file? If you’ve worked with Python before, you may have seen these files in a __pycache__ directory in your project’s folder. These files are generated by the Python interpreter, which amongst other things, compiles your script into Python bytecode, and stores it in the .pyc files.

Why does it do this? When Python runs, it normally compiles scripts into bytecode and runs it in memory. To save a little bit on time, it can do the compilation ahead of time, so that during runtime it just runs the existing bytecode.

"How Python works" Diagram

Approach

We can put together that x.pyc is generated by the cpython interpreter at commit 64113a4ba801126028505c50a7383f3e9df29573 with patch applied.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-def_op('CACHE', 0)
-def_op('POP_TOP', 1)
-def_op('PUSH_NULL', 2)
...
+perm = list(range(90))
+perm2 = list(range(200-90))
+
+import sys
+if "generate_opcode_h" in sys.argv[0]:
+  random = __import__("random")
+  random.shuffle(perm)
+  random.shuffle(perm2)
+
+def_op('CACHE', perm[0])
+def_op('POP_TOP', perm[1])
+def_op('PUSH_NULL', perm[2])
...

Reading patch, we can see that they split the opcodes into two sets - the first 90 and next 86- and randomly permutate their values. So the .pyc file we’re given is generated with these randomized opcodes and we need to figure out the permutation. Once we’ve done this, we have two options to grab the flag - patch cpython ourselves with those permutations and run x.pyc or write a decompiled version of x.pyc that we can run with a standard version of Python.

Some key points that f0xtr0t and I took advantage to unravel the new opcode-value mappings:

  • Only the opcode values are modified. The opcode argument and other fields in the code object do not change. This means we can make educated guesses of what the Python code may look like, compile it, and compare it against x.pyc, looking for patterns.
  • Strings are not modified. They can give us information that may help us determine the structure of the original script.
  • Tooling makes it much easier.

I looked for structural similarities between test.pyc and x.pyc and made guesses on a large number of the bytecode values — something that I found hard to put into words. Yay for pattern recognition?

You just gotta feel the bytecode. - zaratec, 2022

f0xtr0t dealt primarily with disassembling x.pyc with our guessed bytecode values, which later involved writing a whole custom disassembler from scratch 😥.

Honestly, to each other, the other’s work was ✨magic✨. But it got the job done.

Teamwork makes the dream work

Set Up & Tooling

We want to make sure any Python .pyc we generate is as close as possible to x.pyc. Python bytecode and its structure can change pretty drastically between versions, so we need to make sure we’re using the same version.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ git clone -n https://github.com/python/cpython.git
Cloning into 'cpython'...
remote: Enumerating objects: 901673, done.
remote: Counting objects: 100% (37/37), done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 901673 (delta 7), reused 23 (delta 6), pack-reused 901636
Receiving objects: 100% (901673/901673), 480.36 MiB | 4.99 MiB/s, done.
Resolving deltas: 100% (714840/714840), done.
$ cd cpython
$ git checkout 64113a4ba801126028505c50a7383f3e9df29573
$ ./configure
$ make
# This creates a python.exe binary in this cpython directory
# I created an alias "mpython" pointing to this binary
$ mpython --version
Python 3.11.0a7+

This should generate the exact same Python bytecode and structure as that in x.pyc, except of course without the opcode values randomly shuffled.

1
2
$ # Compile test.py into test.pyc 
$ mpython -m compileall test.py

Dumping Test Python Bytecode

I ended up combining the code from this blogpost and pycdump into dump.py to dump both Python bytecode and additional information** - number of arguments, constants, names, variable names, etc. Note that we have to run it under the 64113a4ba801126028505c50a7383f3e9df29573 commit.

1
2
$ # Dump disassembled bytecode and additional information
$ mpython dump.py __pycache__/test.cpython-311.pyc

Disassembling x.pyc with Bytecode Guesses

On the other end, we needed a way to start disassembling the x.pyc bytecode using our guesses for the opcode values. f0xtr0t worked on this, and while he initially started out by tweaking dis and opcode to use our guessed values, we quickly ran into some issues. The disassembler would disassemble a few opcodes correctly, but then all would skip over several bytes. After a lot of digging in C land, f0xtr0t pinned this on the fact that marshal reads and uses opcodewhile we were modifying opcode and dis. Turns out running Python code that does stuff with opcode while we’re updating opcode with our guesses isn’t a great idea.

A few hours into us working on the problem, f0xtr0t opted to switch to a manually-written disassembler, as he figured it’d be a lot less work than dealing with the marshal madness. (It was!) The custom guess-disassembler can be found here in manual_disassemble.py. This tooling was absolutely key to making it past a certain stage, providing information that clued us into certain opcodes’ purposes.

You’ll need to run it with Python 3.11. f0xtr0t showed me pyenv, a Python version management tool, which was pretty useful. Use pyenv install 3.11.0b3 to install and pyenv local 3.11.0b3 to use.*

I provided the code object bytecode offsets and sizes to pull from x.pyc:

Name Offset Size
<module> 0x2a 0x48
ks 0x94 0x76
cry 0x20e 0xa0
fail 0x370 0x70
game1 0x489 0x2fc
w 0x7a2 0x1e
game2 0xc0c 0x1ba
game3 0x105f 0x2ae
main 0x1654 0x1a4

Pattern Recognition Magic & Thinking Like a Programmer

Bear with me as I attempt to make sense of my brain’s thought process.

Python Scripts Typically Start with import Statements

The string dump of x.pyc reveals a few module names – random, sys, and time. We can write a test script that imports them, and then compile and dump it with our tool.

1
2
import random
import sys
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
code b'9700640064016c005a00640064016c015a0164015300'
              0   151 RESUME                            0		        # 0x97 0x00

  1           2   100 LOAD_CONST                        0 (0)			# 0x64 0x00
              4   100 LOAD_CONST                        1 (None)		# 0x64 0x01
              6   108 IMPORT_NAME                       0 (random)		# 0x6c 0x00
              8    90 STORE_NAME                        0 (random)		# 0x5a 0x00
             10   100 LOAD_CONST                        1 (None)		# 0x64 0x01
  2          10   100 LOAD_CONST                        0 (0)			# 0x64 0x00
             12   100 LOAD_CONST                        1 (None)		# 0x64 0x01
             14   108 IMPORT_NAME                       1 (sys)			# 0x6c 0x01
             16    90 STORE_NAME                        1 (sys)			# 0x5a 0x01
             18   100 LOAD_CONST                        1 (None)		# 0x64 0x01
             20    83 RETURN_VALUE                        		        # 0x53 0x00

Each import corresponds to the following bytecode, where N is the Nth import:

1
2
3
4
LOAD_CONST 0
LOAD_CONST 1
IMPORT_NAME N
STORE_NAME N

This pattern repeats 3 times.

Repeating pattern of bytes that might be import

So we have guesses for the new values of LOAD_CONST, IMPORT_NAME, and STORE_NAME:

Opcode New Value Original Value
LOAD_CONST 96 (0x60) 100 (0x64)
IMPORT_NAME 171 (0xAB) 108 (0x6c)
STORE_NAME 150 (0x96) 90 (0x5a)

Scripts Usually Have Functions Like main

Glancing back at the strings dump, we see mention of main. We can make an educated guess that there’s at least one function, main, maybe more. What does this look like in compiled bytecode?

1
2
def main():
    return 123
1
2
3
4
5
6
7
8
9
10
11
12
13
14
              0   151 RESUME                            0

  1           2   100 LOAD_CONST                        0 (<code object main at 0x7ff22d4a66b0, file "test_02.py", line 1>)
              4   132 MAKE_FUNCTION                     0
              6    90 STORE_NAME                        0 (main)
              8   100 LOAD_CONST                        1 (None)
             10    83 RETURN_VALUE

Disassembly of <code object main at 0x7ff22d4a66b0, file "test_02.py", line 1>:
  1           0   151 RESUME                            0

  2           2   100 LOAD_CONST                        1 (123)
              4    83 RETURN_VALUE
   consts

Functions start with a RESUME 0 and end with a RETURN_VALUE. This checks out with the dis documentation where RESUME 0 indicates the start of a function for “internal tracing, debugging, and optimization check” purposes.RETURN_VALUE will return the value at the top of the stack to the calling function.

The <module> code object creates all the functions in the script, storing their code objects under their names. This happens at the very beginning of <module> after modules are imported.

1
2
3
LOAD_CONST 0		# Arg: index into consts table for function code object
MAKE_FUNCTION 0		# 
STORE_NAME 0		# Arg: index into name table for function name

We know the values for LOAD_CONST and STORE_NAME, so we can easily looking for this pattern to figure out the value of MAKE_FUNCTION. We find 7 instances of it, suggesting that there are 7 functions, including mainks, cry, fail, game1, game2, game3, and main.

Pattern of bytes that may be function creation

We know the values of RESUME, MAKE_FUNCTION, and RETURN_VALUE:

Opcode New Value Original Value
RESUME 97 (0x61) 151 (0x97)
MAKE_FUNCTION 99 (0x63) 132 (0x84)
RETURN_VALUE 49 (0x31) 83 (0x53)

Some Functions Like print and sys.exit Have a Well-Known Structure

Functions’ code objects are followed by the const and name tables. We can see one of the code objects has some interesting values:

Constant strings

With "Thanks for playing!", print, sys, and exit, we can guess that the structure of this function looks something like the following:

1
2
3
4
5
6
7
import sys

def func():
    print("Thanks for playing!")
    sys.exit(0)
    
func()

Decompiling this with our dump tool, we get the following for func:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
  3           0   151 RESUME                            0

  4           2   116 LOAD_GLOBAL                       1 (NULL + print)
             14   100 LOAD_CONST                        1 ('Thanks for playing!')
             16   166 PRECALL                           1
             20   171 CALL                              1
             30     1 POP_TOP

  5          32   116 LOAD_GLOBAL                       3 (NULL + sys)
             44   106 LOAD_ATTR                         2 (exit)
             54   100 LOAD_CONST                        2 (0)
             56   166 PRECALL                           1
             60   171 CALL                              1
             70     1 POP_TOP
             72   100 LOAD_CONST                        0 (None)
             74    83 RETURN_VALUE
   consts
      None
      'Thanks for playing!'
      0
   names ('print', 'sys', 'exit')

It seems like the way print and sys.exit’s code objects are grabbed is different, but the actual calling of the function is the same:

1
2
3
PRECALL N			# Arg: Number of function arguments (N)
CALL N				# Arg: Number of function arguments (N)
POP_TOP 			# Remove whatever is at the top of the stack after the function returns

NOTE: We decided that 0x20 must be padding or NOP-like opcodes, as there were a lot of them — too many to make sense to be anything else. f0xtr0t set up our guess-decompiler to ignore them.

Knowing that both print and sys.exit should take 1 argument each, the pattern to look for was PRECALL 1; CALL 1; POP_TOP 0. We see this repeated at least twice with BF 01 8E 01 01 00.

Pattern of bytes that may be function calls

Both this print and sys.exit would have static values. These values would be grabbed from the constants table and pushed onto the stack immediately before PRECALL using the LOAD_CONST opcode. We already know that LOAD_CONST is 0x60 and can see the 0x60 0x01 and 0x60 0x02. We can assume that these are respectively "Thanks for playing!" and 0.

Now the last thing we need to figure out is the way that print and sys.exit’s code objects are grabbed. From our example, we see print is only a LOAD_GLOBAL, whereas sys.exit is a LOAD_GLOBAL followed by a LOAD_ATTR. We see this pattern with B9 00 for print("Thanks for playing!")) and B9 02 BC 02 for sys.exit(0).

Now we have the following new opcode values:

Opcode New Value Original Value
LOAD_GLOBAL 185 (0xb9) 116 (0x74)
LOAD_ATTR 188 (0xbc) 106 (0x6a)
PRECALL 191 (0xbf) 166 (0xa6)
CALL 142 (0x8e) 171 (0xab)
POP_TOP 1 (0x1) 1 (0x1)

game2 likely deals with a lot of math operations

The strings for the game2 function mention several math operations: sum, difference, product, ratio, and remainder from division. We can write a function to perform these operations.

Some strings that suggest math operations

1
2
3
4
5
6
def game2(a,b):
    print(a+b)
    print(a-b)
    print(a*b)
    print(a//b)
    print(a%b)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
  1           0   151 RESUME                            0

  2           2   116 LOAD_GLOBAL                       1 (NULL + print)
             14   124 LOAD_FAST                         0 (a)
             16   124 LOAD_FAST                         1 (b)
             18   122 BINARY_OP                         0 (+)
             22   166 PRECALL                           1
             26   171 CALL                              1
             36     1 POP_TOP

  3          38   116 LOAD_GLOBAL                       1 (NULL + print)
             50   124 LOAD_FAST                         0 (a)
             52   124 LOAD_FAST                         1 (b)
             54   122 BINARY_OP                        10 (-)
             58   166 PRECALL                           1
             62   171 CALL                              1
             72     1 POP_TOP

  4          74   116 LOAD_GLOBAL                       1 (NULL + print)
             86   124 LOAD_FAST                         0 (a)
             88   124 LOAD_FAST                         1 (b)
             90   122 BINARY_OP                         5 (*)
             94   166 PRECALL                           1
             98   171 CALL                              1
            108     1 POP_TOP

  5         110   116 LOAD_GLOBAL                       1 (NULL + print)
            122   124 LOAD_FAST                         0 (a)
            124   124 LOAD_FAST                         1 (b)
            126   122 BINARY_OP                         2 (//)
            130   166 PRECALL                           1
            134   171 CALL                              1
            144     1 POP_TOP

  6         146   116 LOAD_GLOBAL                       1 (NULL + print)
            158   124 LOAD_FAST                         0 (a)
            160   124 LOAD_FAST                         1 (b)
            162   122 BINARY_OP                         6 (%)
            166   166 PRECALL                           1
            170   171 CALL                              1
            180     1 POP_TOP
            182   100 LOAD_CONST                        0 (None)
            184    83 RETURN_VALUE

Rather than using individual opcodes per each arithmetic operation, the Python compiler opts to use the BINARY_OP for each, with the argument specifying the operation. We can see the values for each operation listed in opcode.h. So we should expect to see an opcode with arguments 0x00 (sum), 0x0a (difference), 0x05 (product), 0x02 (ratio), and 0x06 (remainder from division). There are two instances of 0x0a as an opcode argument: C6 0A and 60 0A. We already know the latter is LOAD_CONST, so it’s very likely 0xc6 is BINARY_OP.

Patterns of bytes that may be math operations

We see that there’s a cluster of 6 instances of this opcode with the expected arguments, and two more uses later on, probably some additional calculations (ADD and INPLACE_ADD).

Opcode New Value Original Value
BINARY_OP 198 (0xc6) 122 (0x7a)

Functions within Functions

At this point, while f0xtr0t was writing a custom decompiler, he noticed that there may be a stray code object, w in code1 that didn’t match up to any of the 7 functions.

Code object as a constant

He theorized that it could be a function defined within a function:

1
2
3
4
def game1():
    def w(a,b):
        return a+b
    w(1,2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  1           0   151 RESUME                            0

  2           2   100 LOAD_CONST                        1 (<code object w at 0x7f8fb319e6b0, file "test_05.py", line 2>)
              4   132 MAKE_FUNCTION                     0
              6   125 STORE_FAST                        0 (w)

  4           8     2 PUSH_NULL
             10   124 LOAD_FAST                         0 (w)
             12   100 LOAD_CONST                        2 (1)
             14   100 LOAD_CONST                        3 (2)
             16   166 PRECALL                           2
             20   171 CALL                              2
             30     1 POP_TOP
             32   100 LOAD_CONST                        0 (None)
             34    83 RETURN_VALUE

Disassembly of <code object w at 0x7f8fb319e6b0, file "test_05.py", line 2>:
  2           0   151 RESUME                            0

  3           2   124 LOAD_FAST                         0 (a)
              4   124 LOAD_FAST                         1 (b)
              6   122 BINARY_OP                         0 (+)
             10    83 RETURN_VALUE
   consts
      None
...
   names ()
   varnames ('a', 'b')

In the test w, we can see the number of arguments are defined at a set distance (0x19 bytes) away from the start of the bytecode.

Testing function with 2 arguments

We see that w in x.pyc has 3 arguments.

Testing function with 3 arguments - this value changes

Our test w starts out by loading arguments via LOAD_FAST. We see a pattern of 62 00 62 01 immediately after RESUME, suggesting that these are LOAD_FAST opcodes for the first two arguments. Sometime later, we see 62 02 to load that last argument.

Going back to game1, the function w is defined with a combination of:

1
2
3
LOAD_CONST ?		# Arg: Index into consts tables for code object
MAKE_FUNCTION 0		# 
STORE_FAST ?		# Arg: Store in variable

Since we already know LOAD_CONST and MAKE_FUNCTION, we can look for this pattern towards the beginning of game1’s bytecode.

Pattern of bytes for defining function within function

We see this pattern at the beginning of game1 and considering we haven’t figured out what opcode 0x8d is, it’s likely that it’s STORE_FAST.

Opcode New Value Original Value
LOAD_FAST 98 (0x62) 124 (0x7c)
STORE_FAST 141 (0x8d) 125 (0x7d)

With this, we have everything we need to fully decompile w:

1
2
def w(m,i,j):
        return (m >> (i*10 + j)) & 1

The Rest of It

At this point, f0xtr0t upgraded his customer decompiler to dump additional code object information - name, # of arguments, bytecode, constants, names, variable names, etc. This was a tremendous boon to the rest of our reversing. From here on out, we were able to figure out the rest of the opcodes by disassembling and decompiling x.pyc function by function, and attempting to identify the few unknown opcodes left.

Many of the sequence, string, and list-related opcodes were amongst the last to be figured out, as they oftentimes required some context of the function.

For example, context from the beginning of game3 being decompiled led us to identify LOAD_METHOD

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def game3():
    print("Speed typing game.")
    t = time.time()
    text = '''
  Text: Because of its performance advantage, today many language implementations
  execute a program in two phases, first compiling the source code into bytecode,
  and then passing the bytecode to the virtual machine.
  '''
  words = text ???	
  # 62 01	LOAD_FAST 01 (text)
  # 92 02	???
  # bf 00	PRECALL 00
  # 8e 00	CALL 00
  # 8d 02	STORE_FAST 02 (words)

Thinking like a programmer, it’s likely that text is being split into separate words. Sure enough, split is in the names table and matches up to index 2.

Opcode New Value Original Value
LOAD_METHOD 92 (0x5c) 160 (0xa0)
BINARY_SUBSCR 82 (0x52) 25 (0x19)
UNPACK_SEQUENCE 143 (0x8f) 92 (0x5c)
BUILD_TUPLE 130 (0x82) 102 (0x66)
CONTAINS_OP 117 (0x75) 118 (0x76)
BUILD_LIST 116 (0x74) 103 (0x67)
NOP 27 (0x1b) 9 (0x09)
BUILD_CONST_KEY_MAP 177 (0xb1) 156 (0x9c)
BUILD_STRING 138 (0x8a) 157 (0x9d)

Control Flow

As I started decompiling the functions as best as I could, f0xtr0t worked on identifying and adding control flow, adding the concept of jump markers to the custom disassembler. 0x50 acted as a nop, but seem to denote where a jump/goto would land. He also noticed the use of yield & iters, and added those related opcodes in.

Similar to BINARY_OP, COMPARE_OP performs a certain compare operation dependent on the opcode argument: ('<', '<=', '==', '!=', '>', '>='), as we determined from testing.

Opcode New Value Original Value
<<MARKER>> 50 (0x32) -
COMPARE_OP 104 (0x68) 107 (0x6b)
POP_JUMP_BACKWARD_IF_TRUE 175 (0xaf) 115 (0x73)
POP_JUMP_FORWARD_IF_FALSE 186 (0xba) 114 (0x72)
JUMP_FORWARD 94 (0x5e) 110 (0x6e)
YIELD_VALUE 69 (0x45) 86 (0x56)
JUMP_BACKWARD 123 (0x7b) 140 (0x8c)
GET_ITER 31 (0x1f) 68 (0x44)
FOR_ITER 182 (0xb6) 93 (0x5d)

Decompiled Script

The gist of the script was to play and win each of the 3 games. Each game would return a value that was appended onto the end of seed, which started as "seed:". The final seed would be run through cry with 50 bytes to produce the flag.

Name Goal Returned Value
game1 Navigate maze with your car without crashing or running out of fuel. Fuel stops will give you more fuel. Return each key pressed joined together.
game2 Math quiz. Answer 5 different math questions. Return each solution joined together with a _.
game3 Type a whole paragraph before 20 seconds are up. Return each word uppercased joined together with a _.
1
seed:sssddwwddwddsssdssaaawwssaaaassddddddd:_17_31_72_3_2_:_BECAUSE_OF_ITS_PERFORMANCE_ADVANTAGE,_TODAY_MANY_LANGUAGE_IMPLEMENTATIONS_EXECUTE_A_PROGRAM_IN_TWO_PHASES,_FIRST_COMPILING_THE_SOURCE_CODE_INTO_BYTECODE,_AND_THEN_PASSING_THE_BYTECODE_TO_THE_VIRTUAL_MACHINE._
1
2
3
4
5
6
7
8
9
10
11
12
def ks(seed):
  random.seed(seed)
  while True:
    yield (random.randint(0,255) * 13 + 17) % 256

def cry(s, seed):
  r = []
  for x, y in zip(ks(seed), s):
    r.append(x^y)
  return bytes(r)

print(cry(b'\xa0?n\xa5\x7f)\x1f6Jvh\x95\xcc!\x1e\x95\x996a\x11\xf6OV\x88\xc1\x9f\xde\xb50\x9d\xae\x14\xde\x18YHI\xd8\xd5\x90\x8a\x181l\xb0\x16^O;]', seed).decode())

Flage & Conclusion

While we were cleaning up the script, I decided to yeethaw it and see if my current decompiled version worked well enough. And it did!

All in all, this was a pretty fun challenge that required f0xtr0t’s and my combined expertise to complete. 😁

🚩 CTF{4t_l3ast_1t_w4s_n0t_4n_x86_opc0d3_p3rmut4tion} 🚩

Resources

This post is licensed under CC BY 4.0 by the author.