ELICO: A diachronic database of linguistic evidence
- The nature and role of the ELICO database
- Specific properties
The ELICO project tackled the very general question of language change specifically in the area of determiners and focussed on the evolution of French determiners from the 13th to the 18th century. To this end, ELICO has assembled a database of linguistic evidence built on a diachronic corpus of literary texts. The database consists of a double collection of records containing linguistic information about slightly more than 20,000 occurrences of determiners in French presented with a citation context and textual information about the excerpts from which the occurrences are taken. It is the context that has been extensively annotated with linguistic information, not so much the determiners themselves. For instance, overt information is provided on whether a given determiner occurs within a question, or in a sentence containing a modal verb, or in a noun phrase headed by a mass noun.
The main originality of the ELICO database is its content, given that the annotation concerns the determiners and also their linguistic and textual contexts. This type of annotation allows one to classify occurrences with the help of features, giving them the status of linguistic sets of observations. Moreover, it is possible to link these observations to textual types. For instance, queries sent to ELICO can help the user determine not only whether a given type of text contains significantly more/fewer occurrences of tout, but also whether it contains significantly more/fewer occurrences of tout in certain environments, such as interrogative or negative sentences, etc. In this respect, it is possible to study the evolution of determiners through their specific uses in specific types of text, thus extending the traditional approach based primarily on frequency in texts, or on first occurrence or disappearance of a unit. Scholars can also evaluate whether the option of ignoring text types undermines the validity of a linguistic hypothesis. By virtue of the manner in which the database codes information on the context, it does not impose a specific analysis, thus it can support advanced research carried out within different theoretical frameworks and can be used as a test bench for linguistic hypotheses.
Rallying the semantic descriptive tradition in linguistics to a position more characteristic of the formal tradition, the ELICO project has assumed that the term 'determiner' in general applies to simple articles and demonstratives, e.g. le (themasc,), la (thefem), un (amasc), cet (this), to units such as tout (all), certains (certain), plusieurs (several), etc., and to complex forms like beaucoup de (much/many), un quelconque (any,), un certain (a certain), tous les (all theplu), n'importe quel (whichever), etc. The study of pronouns-namely of units that can replace an NP, e.g. il (he), elle (she), celui-ci (this one), le sien (hers), etc.-has been included only occasionally, for forms that have double use.
The initial corpus used to build the database of ELICO covers six centuries (13th--18th) and is composed of excerpts from texts of various genres, giving a total of slightly more than one million words. The texts have been organised into temporal slices of 50 years each, with no claim of covering the periods in a statistically significant way. The initial 361 texts, in prose, verse and mixed form, are categorised into eight types (dialectical texts, letters, didactic texts, legal texts, narrative texts, poetry, proverbs, theatre). From these texts, 435 excerpts - of three thousand words each - were produced, taking care to vary the point of extraction from the text when the excerpts were smaller than complete texts.
The excerpts are associated with a file that states their textual categorisation and provides information on the author and the opus.
The various forms each determiner may have had in the six centuries under examination, including typographical variants, and the set of its inflected forms, together constitute its 'manifestations' in diachrony and have been targeted by the annotation. For each determiner, these forms were collected in a list that is associated with the contemporary masculine singular form, that can be used as lemmas by the query interface. The set of determiners selected for extensive annotation corresponds to the following lemmas: 'aucun', 'chacun', 'ledit', 'le moindre', 'maint', 'moult', 'plusieurs', 'quelque' et 'tout'. Annotating a determiner in the ELICO database consists of associating all its occurrences, in their various morphological manifestations in each text of the corpus, with an overt representation of specific syntactic and semantic information pertaining to the context, organised in the form of a set of features, i.e. as attribute-value pairs. The information provided covers grammatical features, such as the number and gender of the noun the determiner combines with, properties with a strong semantic impact, such as whether the noun is abstract (i.e. it names an event, an action, a feeling or a quality) or concrete, whether it is countable or uncountable, whether it is modified, features relative to the context, such as the grammatical status of the NP that hosts the determiner in the clause and properties of the verb phrase in the local and main clauses.
The information contained in the linguistic records is the explicit representation of implicit information that can provide evidence relevant to linguistic analysis. The database can be searched using features of the textual or linguistic collections of annotations - or a combination of both - as search criteria. The use of a single query form for searching the database enables the user to vary the criteria at any time and offers the possibility to make one set of features dependent on the other. For example, it is possible to formulate a query about a specific form, for instance the feminine plural form of the universal quantifier, e.g. toutes, or all the occurrences of toutes in a particular type of text or time slice, or to collect all the forms of a determiner across time by using a query on the whole group of forms, for instance by using the lemma 'tout'. It is also possible to formulate more complex queries by using specific criteria. The free-text field of comments present in each linguistic record, where annotators have stored information not coded by the features, can also be exploited in the search. Cases where the universal quantifier tout is floated, for instance, are noted here, but not in an exhaustive way.
The database has been annotated manually and part of the annotation protocol is available as on-line documentation about all the features of the linguistic record and the types of texts, directly accessible from the page with the query form. Each feature is presented with examples of its use.