|
Computer files can be divided into two broad categories:
binary and text. The distinction is vague because
in many contexts, any file is a sequence of digital bits. For
instance, to the circuits which handle information read from or
written to a disk, there is no distinction between text data and
any other sort. The software concerned with those circuits likewise
makes no such distinction. Humans, on the other hand, are concerned
with this distinction.
Text files (plain text files) are files with generally
a one-to-one correspondence between the bytes and ordinary readable
characters such as letters and digits. Therefore any simple program
to view a file makes them human-readable. Generally, they contain
ASCII characters and some control characters such as tabs,
line feeds and carriage returns without any embedded information
such as font information, hyperlinks or inline images. But sometimes
text files contain more than ASCII characters if they are
encoded by East-Asian encoding such as SJIS or Unicode.
If the files are written in Unicode, a UTF standard
such as UTF-8 defines the encoding format. Although text
files are generally human-readable, they can of course be used
for data storage by computer programs. This may be done because
text files avoid problems which may arise with binary files, such
as problems of endianness or the byte-length of integers.
Text files can have the MIME type "text/plain",
often with suffixes indicating an encoding. Common encodings for
plain text include Unicode UTF-8, Unicode UTF-16,
ISO 8859, and ASCII.
A plain text is textual material, usually in a disk file, that
is (largely) unformatted. A webpage with formatted text is not
in plain text in this sense, but the HTML source is. The distinction
is usually not clear-cut.
Source code of the computer programs is usually written as a
text file, but once compiled, it turned into a binary file as
described below.
Transferring text files between Unix, Macintosh,
and Microsoft Windows or DOS computers can be problematic,
as each platform uses different characters to signify a line break.
See new line for a discussion of this confusion. Further cross-platform
confusion occurs because many non-Unix systems have traditionally
used an Extended ASCII character encoding, where the first
128 byte values conform to ASCII and where the upper 128
byte values are mapped to textual or punctuation characters,
such as curly quotes or characters having a diacritical mark.
Prior to the advent of Mac OS X, Macintosh users
would call a document a text file so long as all of its non-whitespace
bytes were printable in the Macintosh environment.
The related term, plaintext, is most commonly used in a cryptographic
context, while cleartext usually refers to lack of protection
from eavesdropping. Usage of these terms is such that there is
some confusion amongst them, especially among those new to computers,
cryptography, or data communications.
Binary files, in contrast, usually contain non-alphabetic
characters, and may contain any byte value at all. They are generally
used to store data rather than textual material in plain text
form. Computer programs are typical examples, as the data and
CPU instructions they contain can in principle be
any binary value. As a result, compiled applications are often
simply referred to as binaries, as opposed to source code, which
is contained in plain text files. But binary files can also be
image files, sound files, compressed files, etc. in short,
any file content whatsoever, including plain text. Usually the
specification of a binary file's file format indicates how to
handle that file.
Binary files are often encoded into a plain text representation
to improve survivability during transit, using encoding schemes
such as Base64.
It is a common misconception that geeks and nerds can read a
binary file. The fact is that binary is nothing more than a number
system. The computer can read the file in any of a number of ways.
Binary files are usually encoded in bytes, which means the binary
digits are grouped in eights. If you open this file in Notepad,
for example, each group of eight bits will be translated as a
single character, and you will see a text file (see above). If,
however, you were to open it in some other application, that application
will have its own use for each byte: maybe the application will
treat each byte as a number, and it will output a stream of numbers
between 0 and 255. If the file were an EXE file, then Windows
would attempt to treat each byte or set of bytes as an instruction.
|