NLP Lecture 05: Introduction to Syntax and Formal Languages
Keywords: Introduction to Syntax and Formal Languages.
Syntax
Syntax as an Interface
- Syntax can be seen as the interface between morphology (structure of words) and semantics.
- Can judge if a sentence is grammatical or not, even if it doesn’t make sense semantically.
Key Concepts of Syntax
- Constituency and Recursion.
- Dependency.
- Grammatical Relations.
- Subcategorization.
- Long-distance dependencies
Constituents
A constituent is a group of words that behave as a single unit (within a hierarchical structure).
Constituency
There is a great number of constituency tests. They typically involve moving constituents around or replacing them.
- Topicalization
- Pro-form Substitution
- Wh-question test.
Constituent Labels
- Choose constituents so each one has one non-bracketed word: the head.
- Category of Constituent: XP, where X is the part-of-speech of the head: NP, VP, AdjP, AdvP, DetP
Recursion in Language
- One of the most important attributes of Natural Languages is that they are recursive.
- There are infinitely many sentences in a language, but in predictable structures.
Context Free Grammars (CFG)
A context free grammar is defined by:
- Set of terminal symbols \(\Sigma\).
- Set of non-terminal symbols \(N\).
- A start symbol \(S \in N\).
- Set \(R\) of productions of the form \(A \rightarrow \beta\), where \(A \in N\) and \(\beta \in (\Sigma \cup N)^*\), i.e. \(\beta\) is a string of terminals and non-terminals.
Language of a CFG
- Given a CFG \(G=(N, \Sigma,R,S)\):
- Given a string \(\alpha A \gamma\), where \(A \in N\), we can derive \(\alpha \beta \gamma\) if there is a production \(A \rightarrow \beta \in R\).
- \(\alpha \implies \beta\) means that \(G\) can derive \(\beta\) from \(\alpha\) in a single step.
- \(\alpha \implies \beta^*\) means that \(G\) can derive \(\beta\) from \(\alpha\) in a finite number of steps.
- The language of G is defined as the set of all terminal strings that can be derived from the start symbol.
- \[L(G) = \{\beta \in \Sigma ^*, s.t. S\implies ^* \beta\}\]
Derivations and Derived Strings
- CFG is a string rewriting formalism, so the derived objects are strings.
- A derivation is a sequence of rewriting steps.
- CFGs are context free: applicability of a rule depends only on the nonterminal symbol, not on its context.
- Therefore, the order in which multiple non-terminals in a partially derived string are replaced does not matter. We can represent identical derivations in a derivation tree.
- The derivation tree implies a parse tree.
Regular Grammars
A regular grammar is defined by:
- Set of terminal symbols \(\Sigma\).
- Set of non-terminal symbols \(N\).
- A start symbol \(S \in N\).
- Set \(R\) of productions of the form \(A \rightarrow aB\) or \(A \rightarrow a\), where \(A,B \in N\) and \(a \in \Sigma\).
Finite State Automata
- Regular grammars can be implemented as finite state automata.
- The set of all regular languages is strictly smaller than the set of context-free languages.
Center Embeddings
- Problem: Regular grammars cannot capture long-distance dependencies.
Formal Grammar and Parsing
- Formal Grammars are used in linguistics, NLP, programming languages.
- We want to build a compact model that describes a complete language.
- Need efficient algorithms to determine if a sentence is in the language or not (recognition problem).
- We also want to recover the structure imposed by the grammar (parsing problem).