Groovy Language Specification  
First Draft  
Chapter 3

Lexical Structure

The organization of this chapter parallels the chapter on Lexical Structure in the Java Language Specification (second edition), which begins as follows:
This chapter specifies the lexical structure of the Java programming language.
Programs are written in Unicode (§3.1, JLS), but lexical translations are provided (§3.2, JLS) so that Unicode escapes (§3.3, JLS) can be used to include any Unicode character using only ASCII characters. Line terminators are defined (§3.4, JLS) to support the different conventions of existing host systems while maintaining consistent line numbers.
The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements (§3.5, JLS), which are white space (§3.6, JLS), comments (§3.7, JLS), and tokens. The tokens are the identifiers (§3.8, JLS), keywords (§3.9, JLS), literals (§3.10, JLS), separators (§3.11, JLS), and operators (§3.12, JLS) of the syntactic grammar.

3.1 Unicode

(Cf. JLS. §3.1.)

Deletion: Groovy has no character literals.

3.2 Lexical Translations

(Cf. JLS. §3.2.)

Unchanged.

3.3 Unicode Escapes

(Cf. JLS. §3.3.)

Unchanged.

3.4 Line Terminators

(Cf. JLS. §3.4.)

Unchanged.

3.5 Input Elements and Tokens

(Cf. JLS. §3.5.)

Addition: The definition of Token includes StringConstructor, as shown below.

Token:
    Identifier
    Keyword
    Literal
    StringConstructor
    Separator
    Operator

It is also noted that line terminators (as defined by §3.4) may be classified as either separators (§3.11) or white space (§3.6), according to the rules of §3.11.

3.6 White Space

(Cf. JLS. §3.6.)

Addition: Some line terminators are transformed into separators, according to the rules defined in §3.11, below.

3.7 Comments

(Cf. JLS. §3.7.)

Addition: If the first two characters of a Groovy program are the ASCII sharp sign and exclamation point ('#' followed by '!'), the whole line is treated as a comment. In other words, the program is treated exactly as if two slash characters ('//') were inserted before the sharp sign. This unusual rule makes it easier to write Groovy scripts on some systems.

Addition: Note that the newline terminating an "end of line" comment can be significant.

3.8 Identifiers

(Cf. JLS. §3.8.)

Change: Groovy identifiers differ from Java identifiers in that the ASCII dollar character '$' is not a legal identifier character.

Note that Groovy provides a mechanism to use any Unicode string as a name. This is restriction applies in practice only to the spelling of unqualified names, since Groovy provides a way to use any Unicode string whatever as a member name. (See Chapter 6.)

(The dollar sign is sometimes used internally by Groovy to mangle non-Java identifiers which must be converted to Java names. For this reason, it would be confusing to allow unescaped dollar signs as Groovy identifier constituents.)

3.9 Keywords

(Cf. JLS. §3.9.)

Addition: The following ASCII character sequences are keywords in Groovy but not in Java:
anyasdefinwith

Addition: In addition to const and goto, the following keywords are reserved in Groovy, but are not currently used.
dostrictfp

3.10 Literals

(Cf. JLS. §3.10.)

Change: A literal is the source code representation of a value of a numeric type (§3.10.1 and §3.10.2), the boolean type (§3.10.3), the string type (§3.10.5), or the null type (§3.10.7). String constructors (§3.10.5.1) are closely related to string literals and may contain non-constant parts. Groovy also builds on the Java numeric literal syntaxes to support literal constants of type BigInteger and BigDecimal.

3.10.1 Integer Literals

(Cf. JLS. §3.10.1.)

The production IntegerTypeSuffix: g G is added, allowing BigInteger constants.

TO DO: 123i allowed? Other literal syntaxes?

Because numbers are objects in Groovy, numeric literals are not allowed to begin or end with adecimal point. Java floating point literals such as .01, 1.f, and 1.e10 must be padded with zero digits, as in 0.01, 1.0f, and 1.0e10.

assert '123' == 123.toString
assert '1.23' == 1.23.toString

3.10.2 Floating-Point Literals

(Cf. JLS. §3.10.2.)

The production FloatTypeSuffix: g G is added, allowing BigDecimal constants.

3.10.3 Boolean Literals

(Cf. JLS. §3.10.3.)

Unchanged.

3.10.4 Character Literals

(Cf. JLS. §3.10.4.)

Deletion: Groovy has no CharacterLiteral token. All literals with character data in them denote strings. Constant strings of unit length serve in the place of character literals, since they coerce properly to character constants.

TODO: include reference to section of Chapter 5 "Conversions" that specifys that unit length strings coorce to characters

3.10.5 String Literals

(Cf. JLS. §3.10.5.)

Change: The following text completely replaces JLS. §3.10.5, and adds §3.10.5.1.

Groovy string literals have a syntax inspired by other scripting languages. A string literal may be delimited by single or double. When following certain operators and punctuation characters, a string literal may also be delimited by forward-slash characters, in which case it is also called a regular expression literal.

Double-quoted literals and regular expression literals may incorporate substring substitution expressions, which are introduced by unescaped dollar signs, and these are specified by §3.10.5.1, below.

Independently, single or double quote marks may be tripled, allowing the string to span multiple lines. Whenever string delimiters are used singly, the string may not contain an unescaped line terminator.

Regardless of the spelling of a line terminator found inside a string literal or constructor, it is always taken to represent a newline character, regardless of the local newline conventions.

To avoid conflict with comment syntaxes, a regular expression literal cannot be empty and cannot begin with a star character. There is also a potential conflict with division operators (those beginning with a forward slash). Division operators are only recognized after certain tokens: Identifiers, keywords, numeric and string constants, right brackets, and unary increment operators (++, --). Any amount of whitespace and line terminators may also occur before a division operator. In other contexts, a slash introduces a regular expression literal.

(Note: The grammar of single-quoted and triple-quoted strings differs only in their processing of line terminators. We express this by means of a grammatical parameter LT, which is true if line terminators are allowed.)

StringLiteral:
  '   (CStringCharacter[LT=false])*   '
  "   (DStringCharacter[LT=false])*   "
  ''' (CStringCharacter[LT=true] (')? (')? )*  '''
  """ (DStringCharacter[LT=true] (")? ("?) )* """
  /   (RStringCharacter (*)*)+   /

CStringCharacter[LT]: InputCharacter but not ' or \ EscapeSequence[LT] LineTerminator when(LT)

DStringCharacter[LT]: InputCharacter but not " or \ or $ EscapeSequence[LT] LineTerminator when(LT)

EscapeSequence[LT]: (Use the Java EscapeSequence definition, plus the following productions) \ $ \ LineTerminator

RStringCharacter: InputCharacter but not a closing quote or \ or $ RStringEscape

RStringEscape: $ (unless part of a match for GStringValuePart) \ LineTerminator \ InputCharacter (but not r or n)

Note that a triple-quoted string may contain isolated single or double occurrences of its close-quote. For example:

assert """x""" == 'x'
assert """""" == ''
assert '''''"''' == "''" + '"'

3.10.5.1 String Constructors

If a double-quoted string literal or regular expression literal contains an unescaped dollar character ('$'), then it is not a string literal, but rather a string constructor, which is a sequence of tokens that comprise an expression for a string containing constant and non-constant parts. Each unescaped dollar character must be followed by a value part, which is a Groovy expression in the form of a name or a block. There may also be a star * between the dollar and the expression, which serves as a spread operator that modifies the manner in which the value is inserted. (The semantics of spreading are defined by the GString class and its users.)

The value part is parsed and evaluated as an ordinary expression. If it is in the form of a block without explicit closure arguments, it is taken to be a closure of no parameters, and is immediately called. In the simplest case, an expression surrounded by braces stands for the value of the expression itself. Apart from such dollars and value parts, every other character between the opening and closing string quotes is treated as in the case of normal string literals.

Thus, string constructor expression consists of alternating literal parts and values. During the tokenization phase of translation, the end of a literal part is determined by the occurrence of an unescaped dollar character, or (at the end) by the appropriate closing quote. The end of a value part is determined either by eagerly parsing a series of dot-separated identifiers, or by parsing a block, with its balanced curly braces.

In a double-quoted string constructor, an unescaped dollar sign must be followed either by a dot-separated series of keywords or identifiers, or by a block. In a regular expression literal, if a dollar sign is not followed by an identifier character or a left curly brace, perhaps after an intervening star, the dollar is deemed to be escaped. (Note that in standard regular expression syntaxes, the dollar sign is never meaningfully followed by a star, a brace, or a letter, so these notations are available for use as string constructor value parts.)

String x = "X"
def xx = [length:1]
assert "$x" == x
assert "${x}" == x
assert "${x}o$x" == "XoX"
assert "\$x" == '$'+'x'
assert "$xx.length" == "1"
assert "$xx.length." == "1."
assert "$xx.length()" == "1()"
assert "$xx.length+2" == "1+2"
assert "$xx . length" == "[length:1] . length"
assert "zXY;" == """z${
    String y = "Y"
    ("$x$y") };""" as String
assert 'EUOUAE'.matches(/^[aeiou]*$/)
assert 'football'.replaceAll(/foo/, "Bar").startsWith('Bart')

The process of parsing an embedded block may be viewed as an approximate parse which attends only to the balancing of curly brace tokens. (Some implementations may be able to perform string constructor tokenization and parsing in one coroutined pass, but the specification does not require this.) The tokenization of a string constructor must follow the format of a GStringLexicalForm. This is not a production of the Groovy expression grammar, but rather a syntax which must govern the separation of literal parts from expressions.

(Note: The term GString refers to a class of string-like objects created by a string constructor expression.)

GStringLexicalForm[LT,RE]:
  GStringStart[LT,RE] GStringValuePart[LT,RE]
        (GStringMiddle[LT,RE] GStringValuePart[LT,RE])*
              GStringEnd[LT,RE]

GStringStart[LT,RE] " (DStringCharacter[LT])* when(!LT & !RE) """ (DStringCharacter[LT])* when(LT & !RE) / (DStringCharacter[LT])* when(RE & !LT)

GStringMiddle[LT] (DStringCharacter[LT])*

GStringEnd[LT] (DStringCharacter[LT])* " when(!LT & !RE) (DStringCharacter[LT])* """ when(!LT & !RE) (DStringCharacter[LT])* / when(RE & !LT)

GStringValuePart[LT,RE]: $ (*)? RawIdentifier ( . RawIdentifier )* $ (*)? { (GStringToken[LT])* }

GStringLiteralPart[LT,RE]: GStringStart[LT,RE] GStringMiddle[LT,RE] GStringEnd[LT,RE]

GStringToken[LT,RE]: InputElement but not { } LineTerminator { GStringToken[LT,RE]* } GStringLexicalForm[LT=false,RE=false] GStringLexicalForm[LT=false,RE=true] GStringLexicalForm[LT=true,RE=false] when(LT) LineTerminator when(LT)

In this way, string constructors are recognized lexically as a complex of GStringLiteralParts and other tokens, according to the grammar of GStringLexicalForm, which does not recognize statement or expression structure, except to balance brackets.

In the resulting mix of tokens, the various literal parts are left as-is, for later parsing as a StringConstructorExpression by the grammar. Each RawIdentifier is reinterpreted as a normal Identifier, or as a keyword, if its spelling is recognized as such. At that point, the dollar signs are insignificant, but are left in the grammar for clarity.

Token:
  GStringLiteralPart[LT=false,RE=false]
  GStringLiteralPart[LT=true,RE=false]
  GStringLiteralPart[LT=false,RE=true]

StringConstructorExpression[LT,RE]: GStringStart[LT,RE] $ Expression ( GStringMiddle[LT,RE] $ Expression )* GStringEnd[LT,RE]

Identifiers and dots after a dollar sign are parsed into GStringTokens eagerly, even though their characters could also be validly parsed as string characters inside a following GStringValuePart. Note that a dot is taken to be part of an embedded expression only if it is followed by a letter or underscore. Also, a whitespace character always interrupts the eager parsing of a name.

When eagerly parsing an identifier after a dollar, keyword recognition is not performed. The following program fragment may not be accepted as a valid Groovy program, because the token 'int' may not follow a dot '.':

String x = "", y = "$x.int"

A string constructor expression which is introduced with a single double-quote character or forward-slash character must occur all on one line. It is an error for the value parts of such an expression to contain line terminators of any sort. The following program fragment may not be accepted as a valid Groovy program, unless the double quotes are tripled:

println "Hello, ${
        'world'}."

Reference: http://archive.groovy.codehaus.org/jsr/threads/iakbeiefedohmiddhked

3.10.6 Escape Sequences for Character and String Literals

(Cf. JLS. §3.10.6.)

Within the various kinds of strings, most characters stand for themselves, but a few have special functions. The backslash character serves as an escape, which turns off the special function of the following character. In quoted string literals, an escaping backslash itself is removed, while in regular expression literals both the backslash and the escaped character are retained. In quoted string literals, a backslash may only be followed by a limited set of characters, the ones which need escaping, or ones which when escaped serve as symbols for certain non-printing characters, such as newline. A backkslash may be followed by one of the letters nrtbf, either of the quotes '", another backslash, a dollar sign, or line terminator, or an octal digit. Any other character following a backslash is reserved for future use.

In regular expression literals, an escaping backslash can be followed by any character, and both characters are preserved as significant.

If a line terminator is escaped by a backslash, both the backslash and the line terminator are disregarded, except within a regular expression literal, where only the backslash is disregarded.

3.10.7 The Null Literal

(Cf. JLS. §3.10.7.)

Unchanged.

3.11 Separators

(Cf. JLS. §3.11.)

Addition: Groovy separators include all the Groovy separators, plus significant newline (described below) and ->.

3.11.1 Significant Newlines

Addition: This entire section is added to the specification.

As tokenization proceeds from left to right, some line terminators are reclassified as significant newlines, which then serve as alternatives to semicolons in the grammar of statements.

As in Java, a LineTerminator? token can occur inside a TraditionalComment, and is always insignificant.

A line terminator inside a string literal is always lexically significant. A line terminator which occurs within an expression embedded in a string constructor may also be significant.

Otherwise, a LineTerminator? is deemed to be insignificant if it is enclosed in parentheses or square bracket tokens, but not (in a certain precise sense) more closely in curly braces. Specifically, the left context of a LineTerminator, viewed in isolation, is converted to tokens. All tokens other than unmatched separators are removed, leaving (if the program is well-formed) as sequence of left parentheses, left square brackets, and left curly braces. The LineTerminator? is regarded as significant if and only if the resulting sequence of separators is either empty, or ends in a curly brace.

Significant newlines are specifically allowed to occur inside non-parenthesized expressions and in end-of-line comments.

(As in Java, no newline of any sort may occur within plain string quotes. The meaning of newlines in strings is described in 3.10.5.)

A significant newline is lexically significant, in that the grammar must somehow account for it in the token stream. However, in many places in the grammar, significant newlines are accepted but discarded.

Many tokens can be regarded as "leaning rightward" (i.e., they are linguistically proclitic) because they clearly require additional tokens to complete a statement. For example, a colon or plus sign never ends a statement, but always requires further tokens. This notion is formalized in the Groovy grammar, where significant newlines are allowed and discarded after such rightward-leaning tokens. Such tokens include prefix and infix expression operators; comma, colon, and dot; and certain keywords (such as "throws") which contribute to declaration syntax. In a few cases, tokens lean rightward only in some contexts. For example, the ++ operator leans rightward only if it is a prefix operator.

These rules provide for easy continuation of long statements or expressions onto multiple lines, without a need to explicitly escape the intermediate line terminators. Generally, expression statements will terminate at newlines if the expression is complete, even if further tokens on the next line would add to the expression. If a programmer breaks a long expression by inserting a newline after an operator, the Groovy and Java languages will agree on the statement continuation. Following Java formatting habits, a programmer may also insert a line-breaking newline before an operator. In such cases, the Groovy parser will detect the error, since the following line fragment, even if it happens to parse as an expression, will amount to an illegal expression statement (§14.8). If there is doubt about a specific expression, enclose it in parentheses to make clear the intended grouping, and disable significant newlines.

3.11.2 Grammatical Significance of Newlines

Addition: This entire section is added to the specification.

For simplicity in subsequent chapters, the following two grammar rules define all occurrences of significant newlines.

NLS:
  (SignificantNewline)*
SEP:
  ';' NLS
  SignificantNewline (';')? NLS

Wherever the grammar allows a semicolon separator token, the grammar also accepts any number of significant newlines instead of or in addition to the semicolon. This is always indicated in subsequent chapters by an occurrences of the SEP nonterminal.

Syntactically insignificant (and optional) newlines are indicated by occurrences of the NLS nonterminal. There are no other uses of the SignificantNewline element.

Unlike Java, but like Pascal, a semicolon functions as a statement separator, not a statement terminator. A statement just before an enclosing right bracket is terminated with or without a final separator token. As with certain scripting languages such as sh and awk, semicolons and significant newlines are interchangeable as statement separators.

println x      //SIG
  println x;     //insig
  println x /*…  insig
    ...*/ + y
  println x +    //insig
    y
  println (x     //insig
    + y)
  println ({ x   //SIG
    y })

3.12 Operators

(Cf. JLS. §3.12.)

Addition: Groovy includes all Java operators, plus:

Operator:
    <=>     ..      ..<     …     *.      ?.      .&
    =~      ==~     **      **=     .@


Specification Table of Contents.

The organization of this chapter parallels the chapter on Lexical Structure in the Java Language Specification (second edition).

The original of this specification is at http://docs.codehaus.org/display/GroovyJSR.