Base64 basics – What? Why? How?

What is base64 encoding?

Base64 encoding schemes are commonly used to represent binary data in ASCII text.
Any binary data is represented in text using only 64 characters from Ascii set. The particular set of 64 characters chosen to represent the 64 place-values for the base varies between implementations. The general strategy is to choose 64 characters that are both members of a subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through information systems, such as email, that were traditionally not 8-bit clean. For example, MIME’s Base64 implementation uses A–Z, a–z, and 0–9 for the first 62 values. Other variations share this property but differ in the symbols chosen for the last two values.

Why Base64?

When you have some binary data that you want to transfer across a network, transferring the raw bytes may not be a good option as some media are made for streaming text. Some protocols may interpret your binary data as control characters. Your binary data could get corrupted because the underlying protocol might think that you’ve entered a special character combination. Encoding binary data to base64 encoding ensure that the data remains intact without modification during transport.

How base64 encoding works?

64 characters from Ascii set are used to represent the base64 characters. The standard alphabet uses A-Z, a-z, 0-9 and + and /, with = as a padding character. As there are 64 characters, 6-bits are required to represent all the characters from this base64 set.

The following mapping shows the base64 characters and their assigned values from 0-63.

base64

Given binary data is interpreted in a group of three 8-bit bytes and each group is converted into four 6-bit base64 characters.  Base64 encoding converts three octets into four encoded characters.

For example consider the text:- ‘Hello. This can be represented in utf-8 as 5 bytes –
48 65 6C 6C 6F

binary representation is:-
01001000 01100101 01101100 01101100 01101111

grouping three bytes will give:-
(010010000110010101101100) (0110110001101111)

for each group get four 6-bit base64 characters:-
(010010 000110 010101 101100) (011011 000110 111100)  (for the second group add two zeroes at the end of the last 4-bit to make it 6-bit)

values for base64 lookup:-
18 6 21 44 27 6 60

replacing by characters will produce the following text:-
SGVs bG8

Add ‘=’ in the last group to complete the 4 character set per three 8-bit bytes group. This is called as padding.

So the base64 representation of the word ‘Hello‘ is :-
SGVsbG8=

How Padding works:-
For the last group, when the number of bytes to encode is not divisible by three (that is, if there are only one or two bytes of input for the last 24-bit block), then the following action is performed:

If there is only one significant input byte, only the first two base64 digits are picked (12 bits). As a result, when the last group contains one octet, the four least significant bits of the final 6-bit block are set to zero. 2 ‘=’ characters are added as padding for the remaining two characters.
for example, encoding the text ‘H‘ will be as follows:-

binary representation:-
01001000

get 6-bit base64 characters:-
010010 000000     (Add 4 zeros to the second 2-bit set to make it 6-bit.)

value for base64 lookup:-
S A

So the final base64 representation is:-
SA==      (Add 2 ‘=’ to complete the 4 character set per three 8-bit bytes group.)

Similarly, if there were two significant input bytes, the first three base64 digits are picked (18 bits). The two least significant bits of the final 6-bit block are set to zero. ‘=’ characters might be added to make the last block contain four base64 characters.

For base64 encoding the ratio of output bytes to input bytes is 4:3 (33% overhead). Specifically, given an input of n bytes, the output will be 4(n/3) bytes long, including padding characters.

 

References

https://en.wikipedia.org/wiki/Base64

Leave a Reply

Your email address will not be published. Required fields are marked *