Towards tackling SPARQL heterogeneity through modular parsing

Jitse De Smet; Ruben Taelman

Introduction

The SPARQL query language [1], a cornerstone of the Semantic Web stack, has evolved through both standardisation and real-world innovation. While SPARQL 1.1 defines a clear and extensible foundation, the ecosystem has gradually diverged as implementers introduced powerful but engine-specific extensions. For example, Virtuoso offers full-text search capabilities [2], Apache Jena supports CONSTRUCT QUAD queries [3], and Oxigraph provides extended date-time-functionality including the ADJUST function [4]. These features are often highly valuable, but also incompatible, creating a heterogeneous landscape where queries that run on one engine may fail on another.

This diversity presents a serious challenge for SPARQL portability, tooling and federated querying. With the finalisation of the SPARQL 1.2 specification [5], the gap between supported language features is likely to widen further, since migration to SPARQL 1.2 is not trivial, requiring substantial updates to the datasets representation and underlying RDF store [6]. The RDF1.1 to RDF1.2 update is substantial mainly because of the introduction of a new triple term, specifically the object of a triple can now be a triple itself, allowing for the recursive definition of triples since the triple contained in the object can again have a triple in the object spot.
Moreover, the working group has announced that after SPARQL 1.2 finalisation, they plan to move toward a more agile “maintenance and new features” mode, which hints at even faster iteration cycles in the future. As a result, there is a growing need for tooling that embraces extensibility and modularity by design.

In this work, we show the need for a modular parser and what such a parser could look like. Unlike traditional parser generators such as ANTLR [7] or Bison [8], which rely on Domain Specific Languages (DSLs) and generate static parsing code — our parser should be defined entirely within a host programming language. This would eliminate the compile step, enable programmatic extension, and leverage strong typing to provide a safer, more developer-friendly API. The parser should not be a handwritten parser either, instead it should use declarative rules such as present in the Typescript-based Chevrotain parser toolkit [9].

A modular parser, that allows you to add, override, or swap grammar fragments at runtime, would empower both researchers and practitioners to create a new generation of language-aware SPARQL tools. This opens the door to use cases such as heterogeneous query tooling (e.g., adapting editors like YASGUI [10] to custom SPARQL dialects), while keeping maintainability in check. Additionally, it would allow SPARQL version translation, and rapid experimentation with new language features. In an ecosystem where SPARQL flavors are growing rather than converging, we believe modularity is not just a nicety—it’s a necessity.

The next section touches lightly on the related work, while Section 3 describes the system architecture. Section 4 sketches the demonstration that we will provide to the workshop. In Section 5 we conclude the future work and desired impact of this research.

Related Work

In this section, we examine prominent software packages in the SPARQL ecosystem that implement parsing capabilities. Our findings are summarized in Fig. 1.

Notably, all discussed major open-source SPARQL parsers rely on either parser generators or parser-building toolkits to define their grammars. In compiled languages such as Rust or Java, the parser generation step can be integrated directly into the main build step—e.g., Oxigraph uses rust-peg for this purpose. Interestingly, in our survey only Stardog’s Millan does not use a parser builder. Instead, it uses Chevrotain without constructing an Abstract Syntax Tree (AST); it appears to focus solely on validation rather than full syntactic analysis.

This highlights a broader pattern: while parser generators dominate SPARQL tooling, few systems are designed with modularity or extensibility as a first-class concern. In particular, full modularity—including the ability to remove grammar rules—is not supported in current public implementations, making adaptation or evolution of these parsers difficult.

Software Package	Parsing Software	Parser Generator
Comunica	SPARQL.JS ^proof	Jison ^proof
Yasgui		SWI Prolog ^proof
Apache Jena		JavaCC ^proof
Oxigraph		rust-peg ^proof
Stardog - Millan		Chevrotain ^proof
Virtuoso		Bison ^proof
Blazegraph		JavaCC ^proof
GraphDB	RDF4J ^proof	JavaCC ^proof

Fig. 1: Each row lists a widely used software package, its associated parsing library, and the parser generator employed. When the parsing software is omitted, the parser is implemented directly within the project. For each usage claim, we provide a link to back up the claim.

Software Architecture

Parsers are typically implemented in one of three ways:

Hand-built parsers: These are manually implemented parsers tailored to a specific grammar. While they can be highly performant due to low-level optimizations and language-specific design, they are often difficult to maintain, extend, or modularize.
Parser generators: Tools such as ANTLR [7] and Bison [8] use a Domain Specific Language (DSL), typically based on Extended Backus–Naur Form (EBNF), to define a grammar. These grammars are then compiled into standalone parser code. While powerful, such approaches introduce a compile step and tend to be rigid, making modular extensions cumbersome.
Parser building toolkits: Libraries such as Chevrotain [9] offer a hybrid approach, enabling declarative grammar specification within a host programming language. These toolkits eliminate the compile step and allow for flexible, programmatic grammar definitions with fine-grained control over behavior and integration.

To support modularity while keeping the mental model approachable, a modular parser should be build using a parser building toolkit. Parsing itself is typically divided into multiple phases [11], of which the following are relevant to this work:

Lexical Analysis (scanning): A lexer transforms a character stream into a token stream.
Syntax Analysis (parsing): A parser transforms the token stream into an abstract syntax tree (AST).
Semantic Analysis: Performed during or after parsing, this phase validates constraints not enforced by grammar alone. For instance, SPARQL forbids binding to a variable which is already in scope.

Inspired by the Comunica modular query engine [12] codebase, the codebase of a modular parser should not be a big monolith but instead use many smaller packages that can be tied together to serve a larger purpose. To facilitate the maintainability of many small packages a monorepo structure could be considered. Within the Comunica codebase, the usage of small packages allows it to define many different builds (eg. a minimal built for the web, and a general built with and without file system access). Similar benefits can be expected in the adoption of such a structure within the modular parser:

Engines: These are prebuilt, ready-to-use components such as SPARQL 1.1 and 1.2 parsers or generators.
Non-engine packages: These expose modular building blocks used to construct engines, such as grammar fragments or core construction utilities.

However, unlike Comunica which uses Components.js, a dependency injection framework using RDF based config files, the modular query engine can be configured within the host language itself since components share a similar interface. We propose that a parser be build using a builder pattern and that parser packages export the builder used, so other may extend upon it. Using a builder pattern for the parser allows you to take a builder that is used to build one parser and manipulate the grammar rules to construct a new parser.

Concretely, we propose a builder which allows rules to be registered by name into a rule map, thereby creating a loose coupling between registered rules. Each rule is defined as a ParserRule object, containing both a rule name and a rule implementation. Rule implementations can be expressed declaratively using Chevrotain’s grammar definition functions like:

SUBRULE: invokes another rule, registered under some name in the current parser,
MANY: matches zero or more occurrences of a pattern,
OR: matches one of several alternatives.

We propose, each rule implementation returns a function that, when invoked, receives the parsing context and any parameters, and outputs part of the final syntax tree. Listing 1 shows an example parser rule definition. The ParserBuilder can then be used for compositional construction and extension through methods like addRule, deleteRule, merge, and typePatch. The typePatch utility would enable type updates to existing rules — particularly useful when extending or modifying a dependent rule without altering the original rule’s implementation. After the construction of your parser, you can build it, as shown in Listing 2, returning a parser which allows you to start parsing a string from any of the parser rules added to the builder - a property transferred from the underlying parser builder toolkit.

import type { SparqlRule } from '@traqula/core';
const iriOrNil: SparqlRule<'iriOrNil', URL | null> = <const>{
  name: 'iriOrNil',
  impl: ({SUBRULE, CONSUME, OR}) => () => OR<URL | null>([
    {ALT: () => SUBRULE(iri, undefined)},
    {ALT: () => {
        CONSUME(nilToken);
        return null;
      } },
  ]),
};

Listing 1: The definition of a parser rule parsing either a URI of the nil token, returning the parser URI or null respectively.

import { ParserBuilder } from '@traqula/core';
const parser = ParserBuilder
  .create([ iriOrNil, rule1 ])
  .addRule(rule2)
  .patchRule(rule1Alternative)
  .build({
    tokenVocabulary: myLexerBuilder.tokenVocabulary,
  });
// The argument and return types of the function are known,
// ast will thus be inferred to have the type `URL | null`.
const ast = parser.iriOrNil(myString, myContext, myParameters)

Listing 2: The construction of a parser including the iriOrNil rule constructed in Listing 1. It also shows how to parse using the iriOrNil rule as the starting rule.

As for the lexer, a similar approach to the parser should be taken. Tokens should be coupled loosely through a name-definition map. The consumption of a token then results in the consumption of the token with that name in the used lexer. Besides that our only requirement is that the tokens can be expressed through the definition of a regex.

Demonstration

In the workshop demonstration, we will showcase how our proof of concept modular parser-builder enables straightforward modification and extension of the existing parsers. Starting from a prebuilt SPARQL 1.1 parser, we will incrementally evolve the grammar in four small steps using the described builder-based architecture. Each change will be demonstrated live, with code edits performed in an IDE and parser behavior verified in a browser-based UI. Specifically, we will:

extend SPARQL to support the ADJUST function [4],
add support for CONSTRUCT QUAD queries [3],
introduce full-text search capabilities [2], and
remove support for the OPTIONAL clause due to its impact on query complexity [13].

These modifications will demonstrate how the modular parser architecture—built around builders enables safe and modular grammar changes with minimal effort. The focus will be on how individual grammar components can be extended or replaced without touching unrelated parts of the parser. We will also highlight how the use of strong typing improves the developer experience by surfacing integration errors at compile time.

For each of the extensions we alter the grammar rules in accordance to the SPARQL 1.1 specification [1] (rule number shown between paratheses):

ADJUST function: We add an ‘ADJUST’ token to the lexer and add a grammar rule for it, then patch the BuiltInCall (121) rule.
CONSTRUCT QUAD queries: Following Jena’s approach, we patch the ConstructQuery (10) and ConstructTriples (74) rules and introduce a ConstructQuads rule.
Full-text search: We patch the objectPath (86) and object (80) rules to allow an ‘OPTION’ keyword followed by a scoring clause like ‘( score Expression )’.
Dropping OPTIONAL: This involves deleting the OptionalGraphPattern (57) rule, patching the GraphPatternNotTriples (56), and removing the ‘OPTIONAL’ token from the lexer.

While the demo is not interactive for attendees, all code and tooling will be made available for experimentation after the session. The demo will serve to illustrate how a modular parser builder enables a new generation of language-aware SPARQL tools with modular, declarative grammar support and a strong developer experience.

Conclusion

In this paper, we presented the need for a modular parser, and offered an initial prototype to cover this need. Our prototype uses a builder-based architecture for constructing extensible SPARQL parsers. By embracing runtime modularity, declarative rule definitions, and strong typing, our approach enables a new class of SPARQL tools that can evolve alongside a rapidly diversifying query ecosystem. Through our demonstration, we showed that parser modification can be performed with minimal overhead and high confidence in correctness.

Looking ahead, several important challenges remain.

In order to bootstrap the adoption of the modular parser, a robust, default parser with a well-defined Abstract Syntax Tree (AST) format should be created. This AST should support round-tripping—ensuring that a query parsed into the AST and then regenerated from it yields a string-identical query. This requirement on the AST will facilitate the creation of language tools such as linters.
To support such round-tripping, we will need to design a corresponding generator. This generator could follow architectural patterns established by the Babel JavaScript compiler [14] combined with the builder pattern described in this work.
We envision the need for a flexible AST transformer system that makes it easy to map the AST into alternative representations. Such a transformer will facilitate static analysis, query-optimization, and translation to other query languages.

Together, these next steps would complete a robust pipeline: from parsing, through transformation, to code generation—all powered by modular, declarative components. We hope this work provides a foundation for building SPARQL tools that are not only adaptable to change, but actively enable it.

Acknowledgements. Jitse De Smet is a predoctoral fellow of the Research Foundation – Flanders (FWO) (1SB8525N). Ruben Taelman is a postdoctoral fellow of the Research Foundation – Flanders (FWO) (1202124N).

Declaration on Generative AI

During the preparation of this work, the author(s) used Chat-GPT-4 in order to: Paraphrase and reword, improve writing style, perform grammar and spelling checks, and drafting the abstract. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s)full responsibility for the publication’s content.