Creating a grammar (legacy Tree-sitter)
Pulsar also has a syntax highlighting and code folding system powered by an earlier version of Tree-sitter.
Getting started
There are two components required to use Tree-sitter in Pulsar: a parser and a grammar file.
The parser
Tree-sitter generates parsers based on context-free grammars that are typically written in JavaScript. The generated parsers are C libraries that can be used in other applications as well as Pulsar.
They can also be developed and tested at the command line, separately from Pulsar. Tree-sitter has its own documentation page on how to create these parsers. The Tree-sitter GitHub organization also contains a lot of example parsers that you can learn from, each in its own repository.
Once you have created a parser, you need to publish it to the NPM registry
to use it in Pulsar. To do this, make sure you have a name and version in
your parser’s package.json:
{
"name": "tree-sitter-mylanguage",
"version": "0.0.1",
// ...
}
then run the command npm publish.
The package
Once you have a Tree-sitter parser that is available on npm, you can use it in
your Pulsar package. Packages with grammars are, by convention, always named
starting with language. You’ll need a folder with a package.json, a
grammars subdirectory, and a single json or cson file in the grammars
directory, which can be named anything.
language-mylanguage
├── LICENSE
├── README.md
├── grammars
│ └── mylanguage.cson
└── package.json
The grammar file
The mylanguage.cson file specifies how Pulsar should use the parser you created.
Basic fields
It starts with some required fields:
name: 'My Language'
scopeName: 'mylanguage'
type: 'tree-sitter'
parser: 'tree-sitter-mylanguage'
scopeName- A unique, stable identifier for the language. Pulsar users will use this in configuration files if they want to specify custom configuration based on the language.name- A human readable name for the language.parser- The name of the parser node module that will be used for parsing. This string will be passed directly torequire()in order to load the parser.type- This should have the valuetree-sitterto indicate to Pulsar that this is a Tree-sitter grammar and not a TextMate grammar.
Language recognition
Next, the file should contain some fields that indicate to Pulsar when this language should be used. These fields are all optional.
fileTypes- An array of filename suffixes. The grammar will be used for files whose names end with one of these suffixes. Note that the suffix may be an entire filename.firstLineRegex- A regex pattern that will be tested against the first line of the file. The grammar will be used if this regex matches.contentRegex- A regex pattern that will be tested against the contents of the file in order to break ties in cases where multiple grammars matched the file using the above two criteria. If thecontentRegexmatches, this grammar will be preferred over another grammar with nocontentRegex. If thecontentRegexdoes not match, a grammar with nocontentRegexwill be preferred over this one.
Syntax highlighting
The HTML classes that Pulsar uses for syntax highlighting do not correspond
directly to nodes in the syntax tree. Instead, Tree-sitter grammar files specify
scope mappings that specify which classes should be applied to which syntax
nodes. The scopes object controls these scope mappings. Its keys are CSS
selectors that select nodes in the syntax tree. Its values can be of several
different types.
Here is a simple example:
scopes:
'call_expression > identifier': 'entity.name.function'
This entry means that, in the syntax tree, any identifier node whose parent is
a call_expression should be highlighted using three classes: syntax--entity,
syntax--name, and syntax--function.
Note that in this selector, we’re using the immediate child combinator
(>). Arbitrary descendant selectors without this combinator (for example
'call_expression identifier', which would match any identifier occurring
anywhere within a call_expression) are currently not supported.
Advanced selectors
The keys of the scopes object can also contain multiple CSS selectors,
separated by commas, similar to CSS files. The triple-quote syntax in CSON makes
it convenient to write keys like this on multiple lines:
scopes:
'''
function_declaration > identifier,
call_expression > identifier,
call_expression > field_expression > field_identifier
''': 'entity.name.function'
You can use the :nth-child pseudo-class
to select nodes based on their order within their parent. For example, this
example selects identifier nodes which are the fourth (zero-indexed) child of
a singleton_method node.
scopes:
'singleton_method > identifier:nth-child(3)': 'entity.name.function'
Finally, you can use double-quoted strings in the selectors to select
anonymous tokens in the syntax tree, like ( and :. See
the Tree-sitter documentation
for more information about named vs anonymous tokens.
scopes:
'''
"*",
"/",
"+",
"-"
''': 'keyword.operator'
Text-based mappings
You can also apply different classes to a syntax node based on its text. Here are some examples:
scopes:
# Apply the classes `syntax--builtin` and `syntax--variable` to all
# `identifier` nodes whose text is `require`.
'identifier': {exact: 'require', scopes: 'builtin.variable'},
# Apply the classes `syntax--type` and `syntax--integer` to all
# `primitive_type` nodes whose text starts with `int` or `uint`.
'primitive_type': {match: /^u?int/, scopes: 'type.integer'},
# Apply the classes `syntax--builtin`, `syntax--class`, and
# `syntax--name` to `constant` nodes with the text `Array`,
# `Hash` and `String`. For all other `constant` nodes, just
# apply the classes `syntax--class` and `syntax--name`.
'constant': [
{match: '^(Array|Hash|String)$', scopes: 'builtin.class.name'},
'class.name'
]
In total there are four types of values that can be associated with selectors in
scopes:
- Strings - Each class name in the dot-separated string will be prefixed with
syntax--and applied to the selected node. - Objects with the keys
exactandscopes- If the node’s text equals theexactstring, thescopesstring will be used as described above. - Objects with the keys
matchandscopes- If the node’s text matches thematchregex pattern, thescopesstring will be used as described above. - Arrays - The elements of the array will be processed from beginning to end. The first element that matches the selected node will be used as describe above.
Specificity
If multiple selectors in the scopes object match a node, the node’s classes
will be decided based on the most specific
selector. Note that the exact and match rules do not affect specificity,
so you may need to supply the same exact or match rules for multiple
selectors to ensure that they take precedence over other selectors. You can use
the same selector multiple times in a scope mapping, within different
comma-separated keys:
scopes:
'call_expression > identifier': 'entity.name.function'
# If we did not include the second selector here, then this rule
# would not apply to identifiers inside of call_expressions,
# because the selector `call_expression > identifier` is more
# specific than the selector `identifier`.
'identifier, call_expression > identifier': [
{exact: 'require', scopes: 'builtin.variable'},
{match: '^[A-Z]', scopes: 'constructor'},
]
Language injection
Sometimes, a source file can contain code written in several different
languages. Tree-sitter grammars support this situation using a two-part process
called language injection. First, an 'outer' language must define an
injection point - a set of syntax nodes whose text can be parsed using a
different language, along with some logic for guessing the name of the other
language that should be used. Second, an 'inner' language must define an
injectionRegex - a regex pattern that will be tested against the language name
provided by the injection point.
For example, in JavaScript, tagged template literals sometimes contain code written in a different language, and the name of the language is often used in the 'tag' function, as shown in this example:
// HTML in a template literal
const htmlContent = html`<div>Hello ${name}</div>`;
The tree-sitter-javascript parser parses this tagged template literal as a
call_expression with two children: an identifier and a template_literal:
(call_expression
(identifier)
(template_literal
(interpolation
(identifier))))
Here is an injection point that would allow syntax highlighting inside of template literals:
atom.grammars.addInjectionPoint("source.js", {
type: "call_expression",
language(callExpression) {
const { firstChild } = callExpression;
if (firstChild.type === "identifier") {
return firstChild.text;
}
},
content(callExpression) {
const { lastChild } = callExpression;
if (lastChild.type === "template_string") {
return lastChild;
}
},
});
The language callback would then be called with every call_expression node
in the syntax tree. In the example above, it would retrieve the first child of
the call_expression, which is an identifier with the name "html". The
callback would then return the string "html".
The content callback would then be called with the same call_expression node
and return the template_string node within the call_expression node.
In order to parse the HTML within the template string, the HTML grammar file
would need to specify an injectionRegex:
injectionRegex: 'html|HTML'
Code folding
The next field in the grammar file, folds, controls code folding. Its value is
an array of fold pattern objects. Fold patterns are used to decide whether or
not a syntax node can be folded, and if so, where the fold should start and end.
Here are some example fold patterns:
folds: [
# All `comment` nodes are foldable. By default, the fold starts at
# the end of the node's first line, and ends at the beginning
# of the node's last line.
{
type: 'comment'
}
# `if_statement` nodes are foldable if they contain an anonymous
# "then" token and either an `elif_clause` or `else_clause` node.
# The fold starts at the end of the "then" token and ends at the
# `elif_clause` or `else_clause`.
{
type: 'if_statement',
start: {type: '"then"'}
end: {type: ['elif_clause', 'else_clause']}
}
# Any node that starts with an anonymous "(" token and ends with
# an anonymous ")" token is foldable. The fold starts after the
# "(" and ends before the ")".
{
start: {type: '"("', index: 0},
end: {type: '")"', index: -1}
}
]
Fold patterns can have one or more of the following fields:
type- A string or array of strings. In order to be foldable according to this pattern, a syntax node’s type must match one of these strings.start- An object that is used to identify a child node after which the fold should start. The object can have one or both of the following fields:type- A string or array of strings. To start a fold, a child node’s type must match one of these strings.index- a number that’s used to select a specific child according to its index. Negative values are interpreted as indices relative the last child, so that-1means the last child.
end- An object that is used to identify a child node before which the fold should end. It has the same structure as thestartobject.
Comments
The last field in the grammar file, comments, controls the behavior of
Pulsar’s Editor: Toggle Line Comments command. Its value is an object with a
start field and an optional end field. The start field is a string that
should be prepended to or removed from lines in order to comment or uncomment
them.
In JavaScript, it looks like this:
comments:
start: '// '
The end field should be used for languages that only support block comments, not line comments. If present, it will be appended to or removed from the end of the last selected line in order to comment or un-comment the selection.
In CSS, it would look like this:
comments:
start: '/* '
end: ' */'
Example Packages
More examples of all of these features can be found in the Tree-sitter grammars bundled with Pulsar: