#+Title: 2 ─ Formal Descriptions #+Subtitle: Of programming language syntax and semantics #+Author: Mark Armstrong, PhD Candidate, McMaster University #+Date: Fall, 2019 #+Description: Applying formal tools to the study of programming languages. #+Options: toc:nil #+Reveal_root: http://cdn.jsdelivr.net/reveal.js/3.0.0/ #+Reveal_init_options: width:1600, height:900, controlsLayout:'edges', #+Reveal_init_options: margin: 0.1, minScale:0.125, maxScale:5 #+Reveal_extra_css: local.css #+html: * LaTeX Header :ignore: #+LaTeX_header: \usepackage{amsthm} #+LaTeX_header: \theoremstyle{definition} #+LaTeX_header: \newtheorem{definition}{Definition}[section] #+LaTeX_header: \newenvironment{NOTES}{ #+LaTeX_header: \begin{center} #+LaTeX_header: \begin{tabular} #+LaTeX_header: {|p{0.9\textwidth}|} #+LaTeX_header: \hline\\ #+LaTeX_header: \begin{scriptsize} #+LaTeX_header: }{ #+LaTeX_header: \end{scriptsize} #+LaTeX_header: \\\\\hline #+LaTeX_header: \end{tabular} #+LaTeX_header: \end{center} #+LaTeX_header: } * Preamble ** Notable references - Concepts of Programming Languages, Sebesta, 10th ed. - Chapter 3 — Describing Syntax and Semantics - Programming Languages and Operational Semantics: A Concise Overview, Fernández - Section 1.3.3 — Syntax - Section 1.3.4 — Semantics - Concepts, Techniques, and Models of Computer Programming, Van Roy & Haridi - Section 2.1 — Defining practical programming languages ** Update history - Sept. 16 :: - Original version posted ** Table of contents # The table of contents are added using org-reveal-manual-toc, # and so must be updated upon changes or added last. #+HTML: #+begin_scriptsize - [[Preamble][Preamble]] - [[Notable references][Notable references]] - [[Update history][Update history]] - [[Table of contents][Table of contents]] - [[Expressions and statements; side effects and impurity][Expressions and statements; side effects and impurity]] - [[Our definition of expression and statement][Our definition of expression and statement]] - [[Side effects][Side effects]] - [[“Code written in Haskell is guaranteed to have no side effects”][“Code written in Haskell is guaranteed to have no side effects”]] - [[Purity][Purity]] - [[Pros and cons of side effects and impurity][Pros and cons of side effects and impurity]] - [[Don't take “side effect free” too literally][Don't take “side effect free” too literally]] - [[Formal tools in the study of programming languages][Formal tools in the study of programming languages]] - [[Blurring the lines between syntax and semantics: static semantics][Blurring the lines between syntax and semantics: static semantics]] - [[Describing syntax][Describing syntax]] - [[Tokens and lexemes][Tokens and lexemes]] - [[Tokenising][Tokenising]] - [[Extended Backus-Naur form][Extended Backus-Naur form]] - [[Ambiguity][Ambiguity]] - [[Enforcing precedence and associativity with grammars][Enforcing precedence and associativity with grammars]] - [[Is addition associative?][Is addition associative?]] - [[Abstract syntax][Abstract syntax]] - [[Beyond context-free grammars: “static semantics”][Beyond context-free grammars: “static semantics”]] - [[An example attribute grammar][An example attribute grammar]] - [[Describing semantics][Describing semantics]] - [[The kernel language approach][The kernel language approach]] - [[More to come...][More to come...]] #+end_scriptsize #+HTML: * Expressions and statements; side effects and impurity Before we can discuss syntax and semantics, we must differentiate the notions of - an expression and - a statement. Recall that, in mathematics - the term “expression” refers to a (syntactically correct, finite) combination of symbols. - Other terms used for this include “term”. Whereas - the term “statement” refers to a specific kind of expression which denotes a truth value. - Other terms used for this include “formula”. ** Our definition of expression and statement In the discussion of programming languages, - the term “expression” refers to a (syntactically correct, finite) combination of lexemes (symbols), and - the term “statement” refers to a specific kind of expression which has a /side effect/. - I.e., we understand the term “statement” as “imperative statement”. - Other terms for this include “command”. Usually we specifically use the term - /expression/ to mean an expression for which we are interested in its value, and specifically say - /statement/ when we are more interested in the expression's effect. Sometimes we are interested in both the value and the effect. ** Side effects A /side effect/ of a program is any change to non-local state. - Changes to the program's memory space are not side effects (of the program). - We can also talk about side effects of /parts/ of programs. - In this case changes to the program's memory space may be side effects, if the change is to variables not local to the part under consideration. Examples of side effects: - Modifying files. - Writing to standard output or writing to the display. - Sending messages over a network. ** “Code written in Haskell is guaranteed to have no side effects” #+begin_quote [[./images/xkcd--haskell.png]] The problem with Haskell is that it's a language built on lazy evaluation and nobody's actually called for it. — [[https://xkcd.com/1312/][xkcd 1312 - Haskell]] #+end_quote ** Purity While we are discussing side effects, let us futher discuss /pure code/. A unit of code is called /pure/ if, in addition to having no side effects, the result of its evaluation is /always the same/. We are usually interested in /pure functions/ in particular. - A pure function's output depends only on its input. - In mathematics, all functions are pure. - If a function's output depends upon other factors, we could make those factors input to increase the function's pureness. ** Pros and cons of side effects and impurity Side effects and impurity may be considered a “necessary evil”. - (Interesting) programs /must/ have some impurity and side effects. - Otherwise, they compute for no reason; we can never “see” any results. - We cannot work entirely in a vacuum! - Impurity and side effects make reasoning about code /much/ more difficult, though! A solution to this problem: - Partition your code into - a /pure/ back-end and - an /impure/, /side-effectful/ front-end. ** Don't take “side effect free” too literally How “side effect free” can we be, exactly? - Executing /any/ code has an effect on the machine executing it. - The program counter, contents of main memory and the caches are changed, energy is expended for the processor, etc. - But these inevitabilities are rarely our concern. - We usually reason about an abstract machine, which hides such physical considerations from us. - How abstract the machine is varies throughout this course, based on the concept we're discussing. * Formal tools in the study of programming languages We've said before, in order to facilitate running code on machines, programming languages must be /formal/. There are two portions to this requirement. - Describing the /syntax/; the form of expressions/statements. - Describing the /semantics/; the meaning of expressions/statements. Formal descriptions of semantics in particular are not always given! - This decreases the reliability of the language. - It can never be clear if implementations (compilers, interpreters) correctly implement the language. - Realistically, a “good enough” description may suffice. ** Blurring the lines between syntax and semantics: static semantics Unfortunately, we cannot cleanly divide elements of programming languages into the categories of syntax and semantics. - The categorisation may depend upon the features of the language. - /Typing/ and /scope/ may be considered syntactic or semantic. - Depending upon whether they are /static/ or /dynamic/. - Even if they are static, and could be considered syntax, such features are often called /static semantics/. We will see an additional reason to segregate such features; - static semantic rules cannot be expressed by the formal tools we use for syntax. - It's either impossible or prohibitively expensive. * Describing syntax The standard tools for describing programming language syntax are /regular expressions/ and /context-free grammars/. Regular expressions may be used to describe /tokens/, (categories of) the smallest syntactic units of a language. Content-free grammars are used to describe (most of) the form of programs. - Static semantics cannot be described by CFGs. Specifically, grammars in /extended Backus-Naur form/ (EBNF) are usually used. - A particular notation for grammars, extended with convenient features which do not increase expressivity (syntactic sugar). As a brief example of EBNF (which we'll dissect [[Extended Backus-Naur form][soon]]), #+begin_src text ⟨digit⟩ ∷= 1 ∣ 2 ∣ 3 ∣ 4 ∣ 5 ∣ 6 ∣ 7 ∣ 8 ∣ 9 ∣ 0 ⟨nat⟩ ∷= ⟨digit⟩ { ⟨digit⟩ } ⟨int⟩ ∷= [ - ] ⟨nat⟩ ⟨exp⟩ ∷= ⟨int⟩ ∣ ⟨exp⟩ (+ ∣ *) ⟨exp⟩ #+end_src ** Tokens and lexemes The smallest syntactic units of a programming language are called /lexemes/. - Think of them as the /words/ of the language. - E.g., ~while~, ~if~, ~int~, ~+~, ~some-variable-name~, ~a-function-name~, etc. /Categories/ of lexemes are called /tokens/. - Comparing with natural languages, think of “prepositions”, “pronouns”, “conjunctions” etc. - In programming, we have, e.g., /identifier/, /literal/, /operator/, /type name/. - Some categories have only a single member, without any additional information, e.g. /while/, /if/. The first step in parsing a program is to convert it from plaintext to a list of tokens. - /Tokenising/. - At this stage, details are abstract; e.g., every identifier becomes just an identifier token (with its name attached in some way for later steps). - Discards unnecessary information (whitespace, comments, etc.) ** Tokenising #+begin_src text x = 0; r = 1; while (x < n) { r = r * x; x++; } #+end_src #+begin_center #+attr_html: :style text-align:center ⇓ #+end_center #+begin_src text id(x) eq lit(0) end_stmt id(r) eq lit(1) end_stmt while openbr id(x) op(<) id(n) closebr open_block id(r) eq id(r) op(*) id(x) end_stmt id(x) op(++) end_stmt close_block #+end_src Disclaimer: this example is purely made up; it's not intended to be a completely accurate depiction of tokenising any particular language. ** Extended Backus-Naur form Consider again the example grammar using extended Backus-Naur form. #+begin_src text ⟨digit⟩ ∷= 1 ∣ 2 ∣ 3 ∣ 4 ∣ 5 ∣ 6 ∣ 7 ∣ 8 ∣ 9 ∣ 0 ⟨int⟩ ∷= ⟨digit⟩ { ⟨digit⟩ } ⟨exp⟩ ∷= ⟨int⟩ ∣ ⟨exp⟩ (+ ∣ *) ⟨exp⟩ #+end_src - Nonterminals are written in angle brackets ~⟨...⟩~. - The symbol ~∷=~ is used to begin a list of productions, rather than ~→~ or ~⟶~. - Braces ~{ ... }~ indicate their contents may be repeated 0 or more times. - We may write ~{ ... }⁺~ to indicate /1 or more/ repetitions. - Brackets ~[ ... ]~ indicate their contents are optional. - The “mid” symbol ~∣~ may be used inside parentheses, ~(... ∣ ...)~ to indicate choice in /part/ of a production. Notations differ! - There is an [[https://www.cl.cam.ac.uk/~mgk25/iso-14977.pdf][ISO standard]]. - We will not write a great number of grammars, so we will only use the notations introduced here. ** Ambiguity Recall that parsing a string (or deriving a string) using a grammar gives rise to a /parse tree/ or /derivation tree/. It is desirable to have a single parse tree for every program. - We should not admit two syntactic interpretations for a program! Three tools for removing ambiguity are - requiring parentheses, - introducing precedence rules, and - introducing associativity rules. ** Enforcing precedence and associativity with grammars To enforce precedence using a grammar: - Create a hierarchy of non-terminals. - Higher-precedence operators are produced lower in the hierarchy. - For instance, - An additive term can be a addition of multiplicative terms, which is an addition of literals, which can be the negation of a constant, variable or term. To enforce associativity using a grammar: - Left associative operators should be produced by left recursive non-terminals. - And right associative operators by right recursive non-terminals. - Operators of the same precedence must associate the same way! ** Is addition associative? Recall that addition is an associative operator. - Meaning it is both left and right associative. So the choice of whether addition in a language associates to the right or to the left may seem arbitrary. - But numerical types in programming are not necessarily the same as numerical types in math! - Addition of floating point numbers /is not associative/. - Consider a binary representation with two-digit coefficients. - 1.0₂ × 2⁰ + 1.0₂ × 2⁰ + 1.0₂ × 2² has a different value depending upon parenthesisation. ** Abstract syntax “Simple”, ambiguous grammars do have a place in describing programming language syntax. - Such grammars describe the /abstract syntax/ of the language. - As opposed to /concrete syntax/. - Consider programs as /trees/ generated by the grammar for the abstract syntax of the language. - Trees do not admit ambiguity! - Such trees more efficiently represent programs. - The shape of the tree expresses structure. - Other unnecessary details may be left out. ** Beyond context-free grammars: “static semantics” For most interesting languages, context-free grammars are not quite sufficient to describe well-formed programs. - They cannot express conditions such as “variables must be declared before use”, and typing rules. - It has been /proven/ that CFGs are not sufficient. - At least some typing rules are possible to express, but prohibitively difficult. Recall the Chomsky hierarchy of languages. #+begin_src text Regular ⊂ Context-free ⊂ Context-sensitive ⊂ Recursive ⊂ Recursively enumberable #+end_src - The properties we need could be described by /context-sensitive/ grammars. - But they are unwieldy! - Instead, use /attribute grammars/; a relatively small augmentation to CFGs. - Each non-terminal and terminal may have a collection of /attributes/ (named values). - Each production may have a collection of rules defining the values of the attributes and a collection of predicates reasoning about those attributes. ** An example attribute grammar Consider this simple grammar. #+begin_src text ⟨S⟩ ∷= ⟨A⟩ ⟨B⟩ ⟨C⟩ ⟨A⟩ ∷= ε ∣ a ⟨A⟩ ⟨B⟩ ∷= ε ∣ b ⟨B⟩ ⟨C⟩ ∷= ε ∣ c ⟨C⟩ #+end_src Suppose we want to allow only strings of the form ~aⁿbⁿcⁿ~. There is no CFG that can produce exactly such strings. But we can enforce this condition using the above grammar augmented with attributes. - Each of the non-terminals ~⟨A⟩~, ~⟨B⟩~ and ~⟨C⟩~ are given an attribute ~length~. - To each production with ~⟨A⟩~, ~⟨B⟩~ or ~⟨C⟩~ on the left side, we attach a rule to compute the ~length~. - The production ~⟨S⟩ ∷= ⟨A⟩ ⟨B⟩ ⟨C⟩~ enforces the condition with a predicate. #+REVEAL: split:t #+begin_src text ⟨S⟩ ∷= ⟨A⟩ ⟨B⟩ ⟨C⟩ Predicate: ⟨A⟩.length = ⟨B⟩.length = ⟨C⟩.length ⟨A⟩ ∷= ε Rule: ⟨A⟩.length ≔ 0 ⟨A⟩₁ ∷= a ⟨A⟩₂ Rule: ⟨A⟩₁.length ≔ ⟨A⟩₂.length + 1 ⟨B⟩ ∷= ε Rule: ⟨B⟩.length ≔ 0 ⟨B⟩₁ ∷= b ⟨B⟩₂ Rule: ⟨B⟩₁.length ≔ ⟨B⟩₂.length + 1 ⟨C⟩ ∷= ε Rule: ⟨C⟩.length ≔ 0 ⟨C⟩₁ ∷= c ⟨C⟩₂ Rule: ⟨C⟩₁.length ≔ ⟨C⟩₂.length + 1 #+end_src In productions with multiple occurrences of the same non-terminal, we number the occurrences so we can easily refer to them in the rules/predicates. * Describing semantics Unlike with syntax, there is not one universally used tool for describing programming language semantics. In this course we will primarily consider /operational semantics/. - A formal description of the meaning programs as a series of computation steps on an abstract machine. - The machine should be more abstract, and more easily understood, than assembly language. - But still “simpler” than the language. - Stack machines and state diagrams are good candidates. Additional approaches include - Denotational semantics. - The meaning of programs are /denoted/ by mathematical objects. - Such as partial functions. - Have to consider /limits/ and non-termination. - Axiomatic semantics. - The meaning of a program is given by a precondition/postcondition calculus. - Such as ~wp~; the “weakest-precondition” calculus. - Very useful for specification. ** The kernel language approach The “kernel language” approach to semantics can be used for languages with many features and constructs. - Choose a small “kernel” set of features/constructs. - Describe the remainder of the language in terms of that kernal language. - The kernel language may be described using the formal approaches mentioned. - /Concepts, Techniques, and Models of Computer Programming/ takes this approach. ** More to come... We will return to the discussion of semantics later in the course.