Zero-Width Character Steganography
Zero-width character steganography is one of my most favorite steganography techniques due to its genius simplicity. For the uninitiated, steganography is the act of hiding some sort of data, such as a plaintext, in a form of cover data. Zero-width character steganography can embed any form of encodable data into a text-based cover data by taking advantage of some of Unicode’s zero-width characters:
0x200B
Zero-width space0x200C
Zero-width non-joiner0x200D
Zero-width joiner0x2060
Word joiner0xFEFF
Zero-width no-break space
Unicode’s zero-width characters are, well, zero-width, making them a perfect vessel for steganography as they would be undetectable to the eye without further analysis. This form of steganography requires a minimum of two zero-width characters to encode any form of data as adjacent zero-width characters without restrictions on the covertext size, or a minimum of one zero-width character to encode data in a non-adjacent manner using a clever trick with a restriction on the covertext size.
Let’s suppose that we wanted to steganographically embed the plaintext string MEET AT DAWN
in the covertext HELLO WORLD
using Unicode’s zero-width space (0x200B
) and zero-width no-break space (0xFEFF
).
We first begin by defining our characters. Notice that we are using two zero-width characters, which can represent a binary numbering system.
Let 0x200B = 0;
Let 0xFEFF = 1;
Then, we encode our plaintext string using our defined numbering system.
MEET AT DAWN
= 01001101 01000101 01000101 01010100 00100000 01000001 01010100 00100000 01000100 01000001 01010111 01001110
We now perform a series of substitutions using our character definitions.
01001101 01000101 01000101 01010100 00100000 01000001 01010100 00100000 01000100 01000001 01010111 01001110
= 0x200B 0xFEFF 0x200B 0x200B 0xFEFF 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0xFEFF 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0xFEFF 0xFEFF 0xFEFF 0x200B
The covertext HELLO WORLD
in Unicode (UTF-16) is:
HELLO WORLD
= 0x0048 0x0045 0x004c 0x004c 0x004f 0x0020 0x0057 0x004f 0x0052 0x004c 0x0044
Our encoded plaintext from before can now be embedded into the covertext data.
0x0048 0x0045 0x004c 0x004c 0x004f 0x0020 0x0057 0x200B 0xFEFF 0x200B 0x200B 0xFEFF 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0xFEFF 0x200B 0x200B 0x200B 0x200B 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0x200B 0xFEFF 0xFEFF 0xFEFF 0x200B 0xFEFF 0x200B 0x200B 0xFEFF 0xFEFF 0xFEFF 0x200B 0x004f 0x0052 0x004c 0x0044
The following page contains the result of our steganographic operations: Demo
On the surface, it looks like nothing more than the words “HELLO WORLD,” but viewing the source code reveals more than meets the eye:
<p>HELLO W​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​ORLD</p>
The following Python script can be used to retrieve this data and extract the embedded plaintext:
#!/usr/bin/env python3
import re
import requests
# Define the steganographic encoding.
ENCODING = {
"​": "0", # Zero-width space
"": "1", # Zero-width no-break space
}
# Define the number base.
BASE = 2
def main():
# Get the page and extract the steganographic data.
html = requests.get("https://shawnd.xyz/blog/uploads/2021-04-01/demo.html").content.decode()
steg = "".join(re.compile("<p>(.+?)</p>").findall(html))
# Extract the zero-width characters and decode them.
extracted = ""
for char in steg.split(";"):
if char in ENCODING.keys():
extracted += ENCODING[char]
# Split the extracted data into 8-bit bytes and decode them.
chunks = [extracted[8*i:8*(i+1)] for i in range(len(extracted)//8)]
# Convert the binary to an ASCII plaintext.
plaintext = ""
for chunk in chunks:
plaintext += chr(int(chunk, BASE))
# Print out the plaintext.
print(plaintext)
if __name__ == "__main__":
main()
Running this script on the target steganographic data yields the following output:
[skat@osiris:~/dl] $ ./script.py
MEET AT DAWN
Awesome! Our proof of concept was successful, and we’ve successfully demonstrated how zero-width characters can be used for steganographic operations.
Note that this example of zero-width character steganography used two zero-width characters and a binary numbering system, but this form of steganography can actually be implemented with as many zero-width characters as is present in Unicode, and with a numbering system matching the quantity of the amount of zero-width characters used.
This technique’s genius simplicity is what makes it one of my favorites, and it also helps give beginners a peak into the kinds of steganography that exist and the general principles driving these techniques.
Happy hacking!