NLP Lecture 05: Introduction to Syntax and Formal Languages

Keywords: Introduction to Syntax and Formal Languages.

Syntax

Syntax as an Interface

  • Syntax can be seen as the interface between morphology (structure of words) and semantics.
  • Can judge if a sentence is grammatical or not, even if it doesn’t make sense semantically.

Key Concepts of Syntax

  • Constituency and Recursion.
  • Dependency.
  • Grammatical Relations.
  • Subcategorization.
  • Long-distance dependencies

Constituents

A constituent is a group of words that behave as a single unit (within a hierarchical structure).

Constituency

There is a great number of constituency tests. They typically involve moving constituents around or replacing them.

  • Topicalization
  • Pro-form Substitution
  • Wh-question test.

Constituent Labels

  • Choose constituents so each one has one non-bracketed word: the head.
  • Category of Constituent: XP, where X is the part-of-speech of the head: NP, VP, AdjP, AdvP, DetP

Recursion in Language

  • One of the most important attributes of Natural Languages is that they are recursive.
  • There are infinitely many sentences in a language, but in predictable structures.

Context Free Grammars (CFG)

A context free grammar is defined by:

  • Set of terminal symbols \(\Sigma\).
  • Set of non-terminal symbols \(N\).
  • A start symbol \(S \in N\).
  • Set \(R\) of productions of the form \(A \rightarrow \beta\), where \(A \in N\) and \(\beta \in (\Sigma \cup N)^*\), i.e. \(\beta\) is a string of terminals and non-terminals.

Language of a CFG

  • Given a CFG \(G=(N, \Sigma,R,S)\):
    • Given a string \(\alpha A \gamma\), where \(A \in N\), we can derive \(\alpha \beta \gamma\) if there is a production \(A \rightarrow \beta \in R\).
    • \(\alpha \implies \beta\) means that \(G\) can derive \(\beta\) from \(\alpha\) in a single step.
    • \(\alpha \implies \beta^*\) means that \(G\) can derive \(\beta\) from \(\alpha\) in a finite number of steps.
    • The language of G is defined as the set of all terminal strings that can be derived from the start symbol.
    • \[L(G) = \{\beta \in \Sigma ^*, s.t. S\implies ^* \beta\}\]

Derivations and Derived Strings

  • CFG is a string rewriting formalism, so the derived objects are strings.
  • A derivation is a sequence of rewriting steps.
  • CFGs are context free: applicability of a rule depends only on the nonterminal symbol, not on its context.
  • Therefore, the order in which multiple non-terminals in a partially derived string are replaced does not matter. We can represent identical derivations in a derivation tree.
  • The derivation tree implies a parse tree.

Regular Grammars

A regular grammar is defined by:

  • Set of terminal symbols \(\Sigma\).
  • Set of non-terminal symbols \(N\).
  • A start symbol \(S \in N\).
  • Set \(R\) of productions of the form \(A \rightarrow aB\) or \(A \rightarrow a\), where \(A,B \in N\) and \(a \in \Sigma\).

Finite State Automata

  • Regular grammars can be implemented as finite state automata.
  • The set of all regular languages is strictly smaller than the set of context-free languages.

Center Embeddings

  • Problem: Regular grammars cannot capture long-distance dependencies.

Formal Grammar and Parsing

  • Formal Grammars are used in linguistics, NLP, programming languages.
  • We want to build a compact model that describes a complete language.
  • Need efficient algorithms to determine if a sentence is in the language or not (recognition problem).
  • We also want to recover the structure imposed by the grammar (parsing problem).

© 2022 Jiawei Lu. All rights reserved.