My ELAN workflow for segmenting and transcription

Hello everyone,

Hedvig here. I'm currently writing up my PhD thesis, hence the lack of writing here. Hopefully I'll be able to pick it up after submission, there's a lot of drafts lying on blogger waiting for completion. If you really, really miss me in particular, you could listen to my ramble at Talk the Talk - a weekly show about linguistics.

Now that the shameless plug and excuses are done with, let's get down and talk about:


In this blog post, I will focus on a part of this challenge¹ - the workflow for segmenting and transcribing  audio material. This is a rough guide, if it turns out people appreciate something like this I'll re-write it more thoroughly. This is a bit sloppily written in places, but trust me - if I do this "properly" right now I will lose days of work time that I should be spending on my thesis... so, I'll only do it if people really want it - and I might wait a while until I do. Sorry, but it is what it is.

Anyone who has done fieldwork that involves interviews, be they video or audio, will know how time consuming it can be to segment and transcribe data.

Estimates of the factor involved here vary, depending on recording quality, the number of speakers involved, etc. Factors smaller than 10 (i.e. ten minutes are necessary to transcribe and translate one minute of recording) are rarely mentioned, and factors as high as 150 and higher are not unrealistic in the case of complex multiparty conversations. (Himmelmann 2018:34)

That's a lot of time, and often times there is no way around it, in particular if you're dealing with a language that has little description.

This challenge isn't only relevant to linguists, but also pertains to anthropologists, historians, journalists and others who need transcription. For journalists and historians, they often interview people in major language like English or Spanish and there's a tonne of software out there for automatic transcription. There's so much, that Adobe has even developed what they call "Photoshop for Audio" alongside their transcription services.

There even exists initiatives to bring this kind of automatic transcription technology to smaller languages. Check out the Transcription Acceleration Project and their tool Elpis here. But even Elpis needs to start with some manually transcribed audio, some training data. So, how do we get nice transcribed data in a timely fashion?

Most linguists who do fieldwork start out using ELAN for transcription. ELAN is a free software from The Language Archive that's fairly easy to use and provides a large amount of functions relating to segmenting and transcribing your data, both audio and video. ELAN is great, don't get me wrong, and this guide will be based on using ELAN. However, the program has a lot of different options and people use it very differently - this can be overwhelming for beginners and it can be difficult to figure out how to optimise it for what you need to do.

Different linguists often develop their own "ELAN-style", and since the workflow (and often also the transcription data itself) isn't shared with people outside of your project- there is little dissemination of these different ELAN-styles. Some people have even described learning ELAN as an apprentice type system, where you may learn the ins and outs by working for someone else first before you start on your own data. If you're attending a linguistic fieldwork class that teaches ELAN, you'll probably be introduced either to your instructors personal ELAN-style, or one of the styles that TLA suggests in their manuals. That can be great and if it's working well for you, awesome! However, it may be that there is some fat to trim of your current ELAN-workflow. I'll share a basic outline of my workflow here, and perhaps you'll find some trick that can improve you workflow too!

My ELAN workflow
Main take-away: you don't need to segment by hand and you don't need to listen through the recording several times for each speaker in order to get speaker separated tiers. The fact that you can export (and import) your ELAN transcription into regular tsv-files can save you a LOT of time and energy.

Caveat: This guide will be rather schematic, if it turns out that this is useful for people I can develop it in more detail later. If you want that to happen, drop a comment on the blogger-blogpost. I have actually basically already described this workflow in two separate blogposts, I'm just brining them together here for a start-to-finish-flow.

Assumptions: you have audio and/or video files of semi-natural conversations where most of the time one person is talking at a time, even if there is some overlap. You want to have it segmented into intonational unites, transcribed, translated and you want to separate out who is speaking when. You have downloaded and installed ELAN and mastered how to create ELAN files and associate them with audio/video-files.

Don't worry about separate speaker/signer tiers: In this workflow, we're going to start out with transcribing all speakers/signers on one tier. If you want them separated out into different tiers, we have an option for that later. Don't worry, it'll be fine. If you have a large amount of overlapping speech or sign utterances and you want them all transcribed separately, you can still use this guide but you'll have to go over the steps for each speaker/signer/articulator. If that is the case, this guide may not be that much more effective than what you're already doing, but let me know if it is.

Caveat 2: I don't make use of "tier types" and their attributes at all in my ELAN-use. I just use the basic time-aligned default tier type. I haven't yet encountered a situation where I really need tier types. It may be that the project your in cares about tier types, if so do make sure that you obey those policies. If not, don't worry about it.

The steps
1) Create two tiers, call them:

  • segmentation by utterance
  • larger segments (optional)

2) Make sure you know how to switch between different modes in your version of ELAN on your OS. We're going to be using the annotation, segmentation and transcription modes.

3) Segment "empty chunks" tier into annotations. Either:
a) Automatic segmenting via PRAAT (see blogpost here)

b) the "one keystroke per annotation; the end time of one annotation is the begin time of the next. This creates a chain of adjacent annotations" segmenting option in ELAN.

Tap whenever you think an intonational utterance has reached its end. If there are pauses, just tap it into smaller chunks. Annotations with silences aren't a big problem, they will just have no transcription in them later so we can remove them automatically then if need be. They can be a bit annoying, but they're not a major problem really.

You may want to adjust the playback speed while segmenting or transcribing. If someone is talking very slowly and going through an elicitation task with clear pauses, you may be able to segment at a higher speed.

Trivia: it seems like intonational units are quite easy for humans to detect, so much so that speakers of German were able to fairly successfully segment Papuan Malay despite not knowing any Malay

4) Larger segments-tier 
If you have several events happening in one recording (say a consent confirmation, a wordlist and a narrative), then you may want to keep track of this during step 3. Either select to only chunk the events you need, or at least make note separately on a piece of paper when an event started and ended if your using 3b. Use that information to create really long annotations in the larger segments tier for each of the events. Alternatively, use the information in the transcription tier later to generate annotations in the larger segments-tier, for example if you know the first and last word of the wordlist you're using.

5) Make copies of the segmentation by utterance-tier with empty annotations and call them
  • Transcription
  • Translation
  • Speaker/signer/articulator²
  • Comment
These will be exactly time aligned with each other, and this is important. Make sure that any obvious goofs in the empty chunks tier are taken care of before you duplicate it.

Keep the empy tier around, you might need it later.

6) Transcription. Switch to transcription-mode. Show only the 4 tiers from step 5.

If you have different people transcribing from translation, select only the tiers that are relevant for that person. Turn on automatic playback and loop mode. Make sure that each person has their own comment tier, and encourage them to write things there while they're transcribing if there is something they want to quickly note.

Make sure you have set clear rules for how to deal with false starts, humming, laughter, backchanneling noises etc. Do you want all of those transcribed? If so, do you have a short hand symbols for them? Make sure you're clear about this early on, especially if you have multiple people working on transcribing the data.

In the speaker/signer/articulator tier put down the appropriate initials of the person/articulator.

Since I don't use tier-types, I can't use the column mode. I don't really mind, but if you prefer using the column set-up then you need to assign the 4 different tiers to different tier types.

If you only want to transcribe a certain event, either only chunk that event in step 3b and not the others. Or go back to annotation mode, write "blubbi" in the first segment on the transcription tier within that event, go back to annotation mode and scroll down until you see "blubbi". Not the most elegant solution, but hey it works.

Leave the silence annotations entirely blank.

7) Overlaps!
Now, you may have overlapping speech/gesture/sign at times. The first thing you need to do is ask yourself this question: do you really need to have all of the overlaps separately transcribed? For example, if it's very hard to make out what one person is saying in the overlapping speech, how valuable is it to you to attempt to transcribe it? It may very well be that the answer is "yes" and "very valuable", and that's all good. Just make sure that this is indeed the case before you go on.

It is entirely possible that you don't want to transcribe instances of overlapping utterances, if that is the case you can stop here and just leave your file in the stat it is in. You can still tease out who is speaking when. The main reason to separate out speakers into separate tiers it to handle overlap, and if you don't care about that you can actually just stick with having all speakers merged on one tier. It will actually probably be easier for you in the long run. I don't do step 8 and 9 normally, but I have figured out how to do them so that if I ever wanted to/was made to - I can separate out speakers.

If you do want to tease them out here's what you do. During step 6 put down the initials of all the people talking at the overlapping annotation in the speaker/signer/articulator tier, write "overlap" in comment tier and leave the other 2 tiers blank. That's it, for now.

8) Separating out the tiers into separate for speaker/signer/articulator

Now we should have an eaf file with transcription & translation for all of the non-overlapping annotations, including information about which person is associated with which annotations and where there is overlap (and who speakers/signs in that overlap).

What we're going to do now is basically make slimmed down version of what we did here. In that guide, we did a clever search within ELAN, exported the results of exactly that search only and imported those results as a new tier. The new tier was merged into the old transcription document, and voila we've got an extra new tier with only the search results. This is useful for example if you want to listen through only words transcribed with [ts] clusters to see if they are indeed realised as [ts] or sometimes as [t]. The same principle also works here where we want to separate out annotations associated with certain people.

We're going to
a) export all of the tiers and all annotations
b) make copies of the exported files and prune each of them to only the annotations that pertain to a certain speaker
c) import those files as new transcription documents
d) merge those with the original file

a) export transcription
Within ELAN, export the entire transcription document as a tab-delimited text file. You do this under File> Export as.. > tab-delimitated text file. Tick "separate columns for each tier".

Name your file something sensible, and put it in a good place. The file will have the file-extension ".txt", but it is a tab-separated file (".tsv"). Rename the file so that the suffix is ".tsv". Open the file in some spreadsheet program (excel, numbers, libreoffice, google sheets, etc). I recommend Libreoffice, because it let's you explicitly set what the delimiters and endcoding are, whereas excel makes a bunch of decisions for you that may not be ideal.

Now, since your annotations are time aligned we get them all on the same row. Here's a little example of what it looks like in my data:

Starttid - hh:mm:ss.msStarttid - ss.msekSluttid - hh:mm:ss.msSluttid - ss.msekTidslängd - hh:mm:ss.msTidslängd - ss.msekLarger segmentsSegmentation by utteranceSpeakerTranscriptionTranslationComments
00:17:56.4501076.4500:20:56.7851256.78500:03:00.335180.335Heti's spectial wordlist



00:17:58.7601078.7600:18:03.7031083.70300:00:04.9434.943Heti's spectial wordlist
Mo se tane lelei e fa'a fa'aaloalo le a:vaa good husband repect his wife
00:18:03.7031083.70300:18:06.6631086.66300:00:02.9602.96Heti's spectial wordlist
T.. tane tane tanehusband
00:18:06.6631086.66300:18:09.0551089.05500:00:02.3922.392Heti's spectial wordlist
Mo le ga . koe faikau uma ais that it . read all of this
00:18:09.0551089.05500:18:16.2631096.26300:00:07.2087.208Heti's spectial wordlist
Mo fea le le manu .. na ou va'ai ai ananafiwhere is the bird i see yesterday
00:18:16.2631096.26300:18:18.6951098.69500:00:02.4322.432Heti's spectial wordlist
Mmanu manu manubird
00:18:18.6951098.69500:18:24.0381104.03800:00:05.3435.343Heti's spectial wordlist
M... o namu e pepesi ai fa'ama'i namosquito spread disease

b) filter the rows
Now, by just using the simple filter functions in most spreadsheet programs, we can make new files that only contains the rows with certain speakers in it. Make a few copies of your tsv file, call them "speaker x", "speaker y" etc. In each of those, filter for all of the rows you want to delete, and delete them - leaving only the rows with the relevant speaker. In the example below, I'm filtering for all the rows where the speaker isn't "M" and deleting those.

c) import filtered tiers into ELAN
Now we go back to ELAN and we import the files as tiers. What will happen here is that a entire new .eaf-file will be created, the tier will actually not be imported directly into whichever file you currently have open.  This means that it doesn't matter which .eaf-file you currently have open when you import (or indeed if any is open). Counterintuitive, I know, but don't worry - I've figured it out. It's not that complicated, just stay with me.

For this to work, the file needs to have the ".txt" suffix again.

File>Import> CSV/Tab-delimited Text file

Importing CSV/Tab-delimited Text file
Next up you will get a window asking you questions about the file you're trying to import, make sure that it lines up with the little preview you get.
Import CSV/Tab-delimited Text file dialogue window.
I wish that ELAN had a way of automatically recognising its own txt-output, but it doesn't. No need to specify the other options, just leave them unchecked.
An actual ghost

Now you will have a new .eaf-file with the same name as the file with the pruned results.
This file will only contain the annotations that matched your filterings. There's no audio file and no other tiers. It's like a ghost tier, haunting the void of empty silence of this lonely .eaf-file.
A lonely ghost tier in an otherwise empty .eaf-file
Save this file and other files currently open in a good place, quit ELAN and then restart ELAN. Sometimes there seems to be a problem for ELAN to accurately see files later on in this process unless you do this. I don't know why this is, but saving, closing and restarting seems to help, so let's just do that :)!
Chris O'Dowd as Roy Trenneman in IT-crowd
d) importing the search results tier into the original file
Now here's where I slightly lied to you: we're not going to import the tier into your file. We're going to merge the pruner speaker-only-file with the other .eaf -file that has all the audio and other tiers and the result is going to be a new .eaf-file. So you'll have three files by the end of this:
  • a) your original .eaf-file with audio and lotsa annotations
  • b) your .eaf-file with only the search results-tier and no audio etc (ghost-tier)
  • c) a new merged file consisting of the two above combined
Don't worry, I've got this.  I'm henceforth going to call these files (a), (b) and (c) as indicated above.

Open file (a). Select "Merge Transcriptions..."

File>Merge >Transcriptions...

Select Merge transcriptions
Now, select file (a) as the current transcription (this is default anyway), file (b) as the second source and choose a name and location for the new file, file (c), in the "Destination" window. You can think of "Destination" as "Save as.." for file (c) - our new file.

Specifying what should be merged and how
Do not, I repeat, do not append. And no need to worry about linked media, because (b) doesn't have any audio or anything (remember, it's a ghost). Just leave all those boxes unchecked.

Let ELAN chug away with the merging, and then you're done! You've now got a eaf file with separate tiers for separate speakers.

9) dealing with the overlap
Now, when you're at step 8b and you're filtering for people, make sure you including the overlapping speech for that person in that file. You're going to have to go back to that tier and search for the instances where you have "overlap" written in the comments and manually sort things out. There's no automatic way of dealing with this this I'm afraid, you're going to have to delete the annotation and make new ones that line up across the tiers for that speaker. Go to annotation mode, hide all the other tiers, keep only the ones for that speaker. Navigate to the overlap by searching, delete the existing annotations in that region, highlight new appropriate time intervals and right click each tier and select "new annotation here". This will give you new aligned annotations intervals that you can now deal associate with just one speaker.


If you're curious how to use this technique but for matching particular searchers, read this blogpost

If you found this useful and want be to write it up a bit more neatly and with more screenshots etc, let me know in the comments. There should be a way of making this work better with python, but I haven't figured that out just yet.

Good bye!

¹ Himmelmann wrote a paper about this challenge, and he says that the actual challenge is "reaching a better understanding of the transcription processitself and its relevance for linguistic theory". We're not going to be doing that here, but please read his paper if this challenge is something that interests you.   
² Articulators are relevant for sign languages and gesture transcription, and this guide actually can fit transcription of speech as well as sign and gesture, including transcribing different articulators on different tiers. 


Popular posts from this blog

A Global Tree of Languages

Language family maps