Thursday, December 8, 2011

Compilers

Generally compiling is a term which is often heard by everyone who is associated with programming, even if remotely. A compiler is a program which converts a high level language program/code into binary instructions (machine language) that our computer can interpret, understand and take the appropriate steps to execute the same.
Let us take an example to understand what does the above definition means.
If you ask any person who is associated with programming, that what the first program he/she wrote was, then the obvious answer would be “Hello World”. So let us also start with the same.
#include<stdio.h>
Void main()  
{
printf("Hello, world!");
}
 
This most basic program prints or displays the words "Hello,World!" on the computer screen. But there is a problem. It is not that simple. Behind the curtains there is lot of complex things going. Let us peep inside these things. The hard truth is that our computer cannot understand the commands/instructions contained in a source file (helloworld.c), because C is a high-level language which means, it contains various characters, symbols, and words that represent complex, numbers-based instructions for eg. printf, main, header files etc. The only instructions a computer can execute are those written in machine language, consisting entirely of numbers that is the binary language in terms of 0 and 1. Before our computer can run our C program, our compiler should convert our helloworld.c into an object file; then a program called a linker should convert the object file into an executable file.
The drawing below illustrates the process.
Compilers 

What is compiler?
What we have defined earlier is only half true. The notion of compiler as a program which converts input (high level language) into output (assembly language or machine code) for some processor is a very limited definition of the same. A compiler in real context takes a string and outputs another string. This definition covers all manner of software which converts one string to anther such as text formatters which convert an input language into a printable output, programs which tend to convert among various file formats or different programming languages and also web browsers.
History
In earlier time only machine dependent programming languages were used and hence any program which could be run on one machine could not run on any other as it was specific to that machine. When high level languages that is machine independent language were first invented in the 40s and 50s no compilers had been written. In fifties the first compiler was written by Grace Hopper. The FORTRAN team lead by John Backus at IBM introduced the 1st complete compiler in 1957. 
Development
Early compilers were complex in their code and also had large compiling time. But with time, several developments took place which led to advanced compiler. The main reason behind this was the splitting of a compiler into parts. In broad terms compiler can be distributed in 3 parts:
·   The front end - It understands the syntax of the source language.
·   The mid-end - Its role is to perform the high level optimizations.
·   The back end - It produces the assembly language.
Front end
The front end job is to analyze the source code file and then to make an internal picture of the program (code), called the intermediate representation or IR. It also manages the symbol table which is a data structure which maps or links each symbol in the program or the code with the corresponding information (location and type are few of the information). This is done over several phases:
·         Line reconstruction - This phase is required so as to convert the input character sequence into a form which is ready for the parsing phase.
·         Lexical analysis -  This phase is required to divide the code into small pieces known as tokens ( for example a keyword, identifier or symbol name). This phase is also called lexing or scanning, and the corresponding software for the same is called a lexical analyzer or scanner.
·         Preprocessing - Some languages for example C, requires a preprocessing phase which allows the macro substitution and sometimes conditional compilation.
·         Syntax analysis -  This phase allows parsing the tokens so as to identify the syntactic structure of the code. This phase is used to build a parse tree according to the rules of the grammar of the language.
.   Grammar: The grammar of a language is needed so as to get a meaningful outcome of the language. It doesn’t defines it meaning, rather defines rules to get a meaning. Grammars are specified using "productions." Few of the examples of production are as follows:
Statement -> if (expression) statement else statement
Statement -> while (expression) statement
.   Tree: Basically tree is the data structure used by the compiler internally to represent the meaning of the code. This phase is after the parsing phase in which the grammar of the language is set with the code.
          ·         Semantic analysis - This phase is used to add the semantic information to the parse tree and hence to build the symbol table for the same. This phase performs semantic checks and rejects the incorrect programs or issue warning.
.   Symbol Table: Every subroutine, variable has some information associated with them.
Variable names – information regarding type, storage location and scope.
Subroutine names – information regarding locations, argument and types. The information is associated with the help of a "hash map", which is a table that associates a string with the corresponding information. 
 

Back end
In layman language one can say that back end is associated with the generation of the code that is machine independent.
The main phases of the back end include the following:
·        Analysis: This phase is the base for the optimization of the code. The typical analyses methods are data flow analysis, dependence analysis, pointer analysis etc.. The call and control flow graph are also built during this phase.
·        Optimization: This is the intermediate phase which converts the language representation into equivalent faster forms. The various optimization methods are inline expansion, dead code elimination, loop transformation, register allocation etc.
·        Code generation: This is generally the last phase in the process as it is associated with output language that is the machine language. This involves resource and storage decisions. Debug data if generated is also generated during this phase.
If we break these three levels then there are total seven levels as follows.
compilers3
Out of these few major fields are explained below:
The Lexer (or lexical analyzer)
The lexer is the first process in the phase of compiling. Its purpose is to decompose the stream of input characters into discrete sets known as "tokens".
Let us take an example to understand it:
char str[] = "Compiler.";
Decomposes into:
token 1: Keyword, "char"
token 2: Identifier, "str"
token 3: Left square bracket
token 4: Right square bracket
token 5: Equals sign
token 6: String, "Compiler."    (Whole strings are token)
token 7: Semicolon
From the above example we can easily say that token is a string of characters, categorized according to the rules that are it may be a Identifier, Number, Comma. The role of lexer is to categorize the token according to a symbol type. The tokens are made so as to make the processing of strings easy.

The Parser
In today’s world when dependency on machines is constantly increasing and lots of complicated tasks are performed by the machines, the phase of parsing is very important. This phase increases the capability of the computer to understand the code and take the corresponding action according to the code.
Parsing is the process of understanding the syntax of a language by representing the code by data structures understood by the compiler.
Generally there are two main methods of parsing that is Top-down parsing and Bottom-up parsing.
·         Top-down parsing partitions a program top-down, programs into modules, modules into subroutines, subroutines into blocks.
·         Bottom-up parsing group tokens together into terms, then expressions, statements, then blocks and subroutines.
Error detection
The error detection is a basically not a phase but a process which keeps on going at the background during the various phases. It is an ongoing process occurring throughout.
·         Lexer - detects malformed tokens.
·         Parser - detects syntax errors.
·         Tree - detects annotation type mismatches. 
Types of Compilers
The compilers are classified according to the machine, input and output. Some of the types of compilers are:
·         One-pass compiler
·         Threaded code compiler
·         Incremental compiler
·         Stage compiler
·         Just-in-time compiler
·         A retargetable compiler
·         A parallelizing compiler
 Compiler Benefits
·         The main benefit of compiler is that it allows you to write the code that is not machine dependent.
·         The compiler converts a high-level language into machine code, and it also looks at the source code to make it efficient (by collecting, reorganizing and generating a new set of instructions to make the program run faster on the computer.)
·         Compilers help in debugging the code as the font coloring and indentation helps to catch the error by the programmer and they also display warning and error messages.

No comments:

Post a Comment