Thursday, June 1, 2017

World map of language families from Glottolog

World map from Glottolog, each language is one dot and coloured by language family (or other top-genetic unit).
Language families are the main way we categorise and understand the language diversity of the world. A language family is a group of languages that have been analysed as having one ancestor,  one great-great-great-and-yet-greater-grand-mother language. Indo-European is a language family, with the sub-groups of Romance, Germanic, Slavic etc.

Maps are great tools for visualising information, we're pretty map-nerdy on this blog. Robert Forkel, one of the editors of Glottolog, kindly shared an interactive map of the world with languages plotted out and coloured by language family with me. This map is interactive, rendered in a web browser with and html and json file.

This map is not available on the Glottolog site, but will later be implemented in the command-line interface. You can see language families on the website by either selecting a country or a specific family. This tool is the only way to see all language families in all countries on Glottolog. 

I will let you know when this is implemented and you can play with it yourself. In the meantime, I thought I'd share this screenshot and talk a little bit about language families.


Some notes on language families, and in particular Glottolog language families and this map

When we look at the collected wisdom of linguistic scholars, we actually find a lot of disagreement. For example, Ethnologue counts to 135 language families and Glottolog to 239!* To read more about this, please go to this post on the "other" languages of Glottolog and Ethnologue, and how the two catalogues define these categories.

Due to lack of data and disagreements, we also have very different estimates for language family depth, i.e. how long time ago the greatest-grand-mother language was spoken. Here are some examples:

Language family proposed date
Afro-Asiatic 9,500 - 18,000
Algic 7,000
Austronesian 6,000-8,000
Dravidian 6,000
Indo-European 5,500

In this case, we're using the language families (and other top-genetic units) from Glottolog. Glottolog is a carefully curated catalogue of languages, and for each grouping there is always a reference provided to where in the academic literature we can find support for exactly how the tree is structured. This is very helpful. With this said, it's worth noting that Glottolog often tends to be more "splitting" (not lumping languages into very large families) than other similar resources, like Ethnologue. In general, Glottolog often represents a more conservative view of language history.

Glottolog also contains other kinds of groupings besides what we commonly think of as "families", for example: unattested, sign languages, isolates, pidgins, artifical etc. More on this here.

Please remember when you look at this/these map that:

  • stacking of dots is not trivial, Nigeria for example looks more full of atlantic-congo languages than it is, see images below. Zoom in for denser areas
  • the colours on this map were not picked manually, but assigned automatically
  • Creoles are in the family of their lexifier
  • there are other groupings besides traditional language families in the dataset
  • these are dots, not polygons
  • this will be implemented as a command line tool, so you should get your git and python on in order to make these yourself.

Nigeria in the world map at the top of the post
Nigeria zoomed in
Here are some more zoomed in areas for your enjoyment
The island of New Guinea
Mainland South East Asia
Top South America

Language Family Tournament

On a sillier note, the Facebook page Etymology Memes for Reconstructed Phonemes recently ran a tournament where followers could vote for which was their favourite language family from a set of 24. Since this is related to the content of this blog post, I'll share those results as well!
A tournament on Facebook where followers of the page
"Etymology Memes for Reconstructed Phonemes" could vote for which was their favourite language family.
The winner of said contest, Basque
Other ways of categorising languages besides language families
There are other way of categorising languages than into language families, most notably into geographic areas. It seems that languages that are in contact influence each other. Furthermore, it is not necessarily true that all parts of a language (sound system, vocabulary, grammar, syntax, etc) has one and only one shared ancestry - there could be multiple underlying trees for different parts of language. It may be that the counting system was borrowed from neighbour x and some phonemes imported from neighbour y. Another reason for multiple trees is dialect chains breaking up and coming together again, which is hard to detect given enough time.

Besides these approaches, we can also categorise languages into types (suffixing, tonal, CVCV, VSO, isolating etc). This is what typologists do. Knowing the distribution of various traits in the worlds languages, we can not only investigate language history, but also ask questions such as:

  • are certain traits correlated with each other?
  • are there trade-offs between traits, for example to minimize complexity?
  • are there cognitive constraints on combination of traits?

Ok, that's it for now. Hope you enjoyed this!



* In order to make a fair comparison, I've excluded some special cases that the two catalogues deal with in very different ways or that we have very little data on. For Ethnologue, I've excluded: constructed languages (1), creoles (88), deaf sign languages (137), language isolates, mixed languages (21), pidgins (13), and unclassified languages (51). For Glottolog I've excluded pidgins (79), isolates (198), mixed languages (23), artificial (9), speech registers (6), “unattested” (61), “unclassifiable” (117) and sign languages (166). Creoles in Glottolog are classified under their lexifier family, making them hard to count, but they don’t increase the number of families. There are 37 language with "creole" or "kriol" in their name in Glottolog, but I didn't subtract these since they belonged to families that also contain non-contact languages.

No comments:

Post a Comment