CLDF for dummies (v1.0)

 

I wrote a little document called "CLDF for dummies" based on what I know about CLDF that I think may be helpful to other researchers in language and cultural diversity and evolution. I am NOT a CLDF-developer or editor, this is all from an end-user perspective.

I'll keep a full and updated version here. Here is version 1.0:

CLDF for dummies

This document outlines some of the very basics of the Cross-Linguistic Data Formats (CLDF) for researchers who want to use the data sets for analysis, comparison or plotting. CLDF is a way of organizing language data, in particular data sets with many different languages in it. The basic organisation is a set of tables, usually in csv-sheets (languages.csv, forms.csv etc). These documents are linked to each other in a specific way which makes it possible to combine them into an interlinked database. The files are all governed by standards, there are sanity-checks to make sure all lines up right. Because they are often just plain csv-sheets they can easily be read in by most data analysis software programs like python, R, julia etc or just regular spreadsheet programs like LibreOffice or Microsoft Excel. It is not necessary to use FileMakerPro, Microsoft Access or similar programs.

It’s plain, flat and simpler than you might think. In this document, you will learn the very basics on how it works.

The data format was first published in 2018 [1] and has since then expanded to include a large amount of different data sets.

CLDF is well-documented. This document is a very basic intro, for more advanced queries go to https://github.com/cldf/cldf/#readme and https://cldf.clld.org/

Before we start

Good things to keep in mind:

How to know if you’re dealing with a CLDF-dataset

You are dealing with a CLDF-data set if there is a file ending with the extension “json” and at the top it identifies a CLDF-dataset type. For example, it could be dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#StructureDataset". (There is one exception, see “Good to know” below.)

Typically, there is a folder called “cldf” with files like “languages.csv”, “values.csv” and “StructureDataset-metadata.json” in it. The last file will be different depending on the type of data set.

Here are some examples of data sets that are available in CLDF that you may have encountered:

  • WALS (World Atlas of Language Structures)

  • PHOIBLE (Phonetics Information Base and Lexicon)

  • D-PLACE (Database of Places, Language, Culture and Environment)

  • Glottolog

  • Lexibank

  • Grambank

Good to know: It is possible for a CLDF-dataset to only consist of one file. No json, no set of csvs. Just one file, for example values.csv. In such cases, the file doesn’t have any meta-data specified and just conforms to all the default settings. You can’t tell by a json that it’s a CLDF-dataset because there isn’t one. This type of CLDF-data set is rare, and will not be dealt with further here.

Types of CLDF-datasets

There are five types of CLDF-datasets. They are also known as “modules”.

  • Wordlist (lexicon, has Forms and often Cognates)
  • Structure dataset (grammar or other types of information with one value for a Parameter and a Feature, has Values)
  • Dictionary (particular kind of lexicon, has Entries and Senses)
  • Parallel text (collections of paragraphs of the same text in different languages, has Forms, Segments and FunctionalEquivalents)
  • generic (no specifics)

Contents

Each CLDF-dataset (except the metadata-free ones) consists minimally of:

  • a set of tables (usually in csv-sheets)
  • a json-file

The tables are usually in csv-format and contain the data itself. The json file has information about the dataset, for example the type of dataset is, what the contents are, what the filenames are etc.

Many CLDF-datasets also contain a bibTeX-file with bibliographic references for the data. In such cases, each data-point is tied to a reference by the key in the bibTeX entry. Usually the key is in a column called “Source” in the ValueTable or FormTable. The bibTeX file is usually called “sources.bib”. If it’s called something else, it’ll say so in the meta-data json file.

Tables inside the datasets

There are some tables that occur in most CLDF-datasets, and some that occur only in certain types. For example, there is no table with word forms for Structure data sets - that’s for wordlists and Dictionaries.

The tables have specific names in the CLDF-world and have pre-defined specifics. The names are different from their filenames. You can see which name is tied to which file in the json. “LanguageTable” is usually found in the file languages.csv, “CodeTable” in codes.csv, “ValueTable” in values.csv, “CognateTable” in cognates.csv etc.

  • LanguageTable -> languages.csv (contains minimally ID)
  • FormTable -> forms.csv (contains minimally ID, Form, Language_ID, Parameter_ID)
  • ParameterTable -> parameters.csv (contains minimally ID, Name) etc.

The json-meta data file says which table is in which file, it’s specified as the url of the table which conforms to a certain CLDF-standard, for example for LanguageTableYou can’t always bank on LanguageTable being in languages.csvpycldf and rcldf can handle this for you, i.e. look up in the json what table is where and set all that up.

Each table is usually tied to several pre-defined CLDF standards for the content. For example, FormTables need to have the columns “ID”, “Form” and “Language_ID” and they in turn need to look a certain way.

Tables can have more columns than the minimal requirement and can have columns that don’t map onto CLDF-standards at all.

For more specifics, see this file for CLDF v1.0 http://cldf.clld.org/v1.0/terms.rdf.

Tables in most CLDF-dataset

Here are CLDF-tables that occur in most CLDF-datasets.

  • LanguageTable - list of all of the languages in the dataset. May also include things classified by Glottolog as dialects or proto-languages. Includes meta-information like longitude, language family etc.
  • ParameterTable - contains a definition of the variables. For lexicon, these are the concepts, for grammar these are the features.

Wordlist also contain

  • FormTable - the forms for each concept for each language
  • CognateTable (not obligatory) - the cognate classification per form per concept per language

Structure data-sets also contain

  • ValueTable - the value for each parameter and language. Usually also Comment and Source.
  • CodeTable - The list of possible values for each parameter. For example, GB020 in Grambank is a binary feature and can take 0, 1 and ? whereas EA016 in the Ethnographic Atlas (D-PLACE) can take 1, 2 or 9. The options are exclusive of each other for each data-point.

Good to know: for the CLDF-dataset of D-PLACE, the LanguageTable contains a row per society. There is a column for the Glottocode of the language associated with that society.

Columns in tables

Each table consists of a set of columns. The names of these columns are often for example "ID", "Longitude", "Value" etc. However, they can vary. The meta-data contains information on which column name maps onto what property in the CLDF-universe. For example, there is the property "source"", which has the propertyURL http://cldf.clld.org/v1.0/terms.rdf#source and often is mapped onto a column called "Source". However, if one CLDF-creator wanted to name this column "Reference" instead, that's all well and good. The json-metadata-file would tell the users what column "Reference" corresponds to the standardised property "source" and point to the property-url. As with filenames of tables, you can often get by with assuming that bibliographic references are in a column called "Source" and the LanguageTable is in languages.csv --- but this needn't always be true! All glory to the json-metadata file.

Example: Wordlist

Below is a tiny Wordlist CLDF-dataset. This dataset contains 3 words in 2 languages. The first two tables, LanguageTable and ParameterTable contains information about the languages and parameters - in this case concepts. The FormTable contains the actual forms. For one of the concepts, one of the languages has two words and both are listed.

The meta-data json is not included here. You can see an example of a Wordlist-metadata json file here: https://github.com/lexibank/abvd/blob/master/cldf/cldf-metadata.json.

LanguageTable

One row = one language (or sometimes dialect or proto-language, i.e above language in a tree). The ID column uniquely identifies each language in the dataset. In other tables, the column that links to the ID column here is called “Language_ID”.

IDNameGlottocode
15Bintulubint1246
18CHamorrocham1312

Good to know: Sometimes the IDs in LanguageTable are Glottocodes or ISO 639-3 codes, but they don’t have to be. They just have to be unique within that dataset. In Grambank, the ID’s are Glottocodes, but WALS has its own specific unique code-system different from both Glottocodes and ISO 639-3. If you want Glottocodes, go look for a column called Glottocode in the LanguageTable - don’t use the ID column.

Good to know 2: Glottocodes contain 4 letters or numbers and then 4 numbers. The first 4 characters are not always letters. For example, ww2p1234 and 3adt1234 are existing glottocodes.

ParameterTable

One row = one parameter. The ID column uniquely identifies each parameter in the dataset. In other tables, the column that links to the ID column here is called “Parameter_ID”.

IDNameConcepticon_ID
144_toburnto burn2102
2_leftleft244

FormTable

One row = one form. The ID column uniquely identifies each form in the dataset. In other tables, the column that links to the ID column here is called “Form_ID”. Here we also see Parameter_ID, which links to the column ID in the ParameterTable and Language_ID which links to the column ID in the LanguageTable.

IDParameter_IDLanguage_IDFormSource
15-144_toburn-1144_toburn15pegewBlust-15-2005
15-144_toburn-2144_toburn15tinewBlust-15-2005
18-2_left-12_left18akague38174

The ID column here is a combination of the Language_ID, Parameter_ID and last a number to distinguish if there are more than one form. For example, because Bintulu has two words for “to burn”, there are two rows with different Forms but the same Parameter_ID (they both mean “to burn”). The ID column, which identifies each form has a number at the end of the string which indicates the different form. If there is only one form, the string ends with “-1”, but as you can see for “to burn” it first has “-1” and then “-2”.

Source

Optional file, but often present in the form of a bibTeX-file. One entry = one source. The bibTeX file is usually called “sources.bib”, but not necessary (check metadata.json as usual). The bibTeX Key (the first string after @BIBTEXENTRYTYPE{) maps onto the Source column in the FormTable above.

@misc{Blust-15-2005,
    author = {Blust},
    date = {2005},
    howpublished = {personal communication}
}

@book{38174,
    author = {Topping, Donald M. and Ogo, Pedro M. and Dungca, Bernadita C.},
    address = {Honolulu},
    publisher = {The University Press of Hawaii},
    title = {Chamorro-English dictionary},
    year = {1975}
}

example: Wordlist - linking together

Each of the tables has a column called “ID”. This column allows us to link the tables together. The column “Language_ID” in the FormTable maps onto the column “ID” in the LanguageTable, and so on.

  • Langugage_ID -> ID column in LanguageTable
  • Parameter_ID -> ID column in ParameterTable
  • Form_ID -> ID column in FormTable.

There is no column “Form_ID” inside the FormTable, it’s just called ID there. Same with Parameter_ID and the ParameterTable and so on.

WARNING Some LanguageTables contain a column called “Language_ID” which is not the same as the ID column. For dialects, this column contains the Glottocode of the language that they are a dialect of. For example, Eastern Low Navarrese is a dialect of Basque. The glottocode of this dialect is east1470. The glottocode of the language Basque is basq1248. If a LanguageTable has the column Language_ID, it would contain basq1248 for the dialect. This helps when you might want to match by the language-level rather than dialect-level.The LanguageTable in Glottolog contains a column of this kind called “Language_ID”. In Grambank, there is a similar column, but it is called “Language_level_ID”.

With the above information, we can now combine the tables if we want. For example, we can rename the ID column in each of the tables to “Language_ID”, “Parameter_ID” and “Form_ID” and then join them together into one new table. In the example below, not all columns are shown due to space. Nota Bene that both ParameterTable and LanguageTable contains the column “Name”, so they would have to be dropped or otherwise handled (for example renamed to “Parameter_name” and “Language_name”) otherwise the joining would not work correctly.

Form_IDParameter_IDLanguage_IDFormSourceGlottocodeConcepticon_ID
15-144_toburn-1144_toburn15pegewBlust-15-2005bint12462102
15-144_toburn-2144_toburn15tinewBlust-15-2005bint12462102
18-2_left-12_left18akague38174cham1312244

CLLD and CLDF

CLDF is a type of data-format, the set of tables etc. CLLD is a larger project and stands for Cross-Linguistic Linked Data. CLDF is a part of CLLD. CLLD also does web applications, for example https://clics.clld.org/. CLDF data interfaces smoothly with CLLD web applications.

Advanced

This document is only a very basic intro. If you want to learn more, go to: https://github.com/cldf/cldf/#readme.

References

[1] Forkel, R., List, J. M., Greenhill, S. J., Rzymski, C., Bank, S., Cysouw, M. Hammarström, H., Haspelmath, M., Kaiping, G.A. and Gray, R. D. (2018). Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific data, 5(1), 1-10.

Comments

Popular posts from this blog

A Global Tree of Languages

Language family maps

My ELAN workflow for segmenting and transcription