library(tidyverse) # I have to have tidyverse
library(stringr) # to manipulate string
library(readtextgrid) # to read TextGrid files
library(dplyr)
Montreal Forced Aligner Workflow for PCIbex Production Experiments
Introduction
We have our data from PCIbex and our zip files from the server. Let’s assume we have unzipped them and converted them from .webm
to .wav
files. Now, it’s time to align them using MFA. But before that, we need to prepare our files accordingly. Here are the steps we will follow in this document:
- Load the PCIbex results
- Filter out irrelevant sound files
- Move all of our
.wav
and.TextGrid
files to the same directory - Rename our files according to MFA guidelines
- Run MFA
- Create a dataframe
Before we start, let me load my favorite packages. The library()
function loads the packages we need, assuming they are already installed. If not, use the install.packages()
function to install them. While library()
does not require quotes, you should use quotes with install.packages()
, e.g., install.packages("tidyverse")
. If it asks you to select a mirror from a list, choose a location geographically close to you, such as the UK.
PCIbex Results
Read the results
The main reason we are loading PCIbex results is because sometimes we use the async()
function in our PCIbex code. The async()
function allows us to send recordings to our server whenever we want without waiting for the end of the experiment. Even though it is extremely helpful in reducing some of the server-PCIbex connection load at the end of the experiment, it also creates some pesky situations. For example, if a participant decides not to complete their experiment, we will still end up with some of their recordings. We do not want participants who were window-shopping, mainly because we are not sure about the quality of their data. Luckily for us, PCIbex only saves the results of participants who complete the entire experiment.
To read the PCIbex results, we are going to use the function provided in the PCIbex documentation. Scroll down on that page, and you will see the words “Click for Base R Version.” The function is provided there as well. Moreover, please be careful whenever you are copying and pasting functions from this file, or any file, as sometimes PDF or HTML files can include unwanted elements, like a page number.
# User-defined function to read in PCIbex Farm results files
<- function(
read.pcibex
filepath, auto.colnames=TRUE,
fun.col=\(col,cols){cols[cols==col]<-paste(col,"Ibex",sep=".");return(cols)}
) {<- max(count.fields(filepath,sep=",",quote=NULL),na.rm=TRUE)
n.cols if (auto.colnames){
<- c()
cols <- file(filepath, "r")
con while ( TRUE ) {
<- readLines(con, n = 1, warn=FALSE)
line if ( length(line) == 0) {
break
}<- regmatches(line,regexec("^# (\\d+)\\. (.+)\\.$",line))[[1]]
m if (length(m) == 3) {
<- as.numeric(m[2])
index <- m[3]
value if (is.function(fun.col)){
<- fun.col(value,cols)
cols
}<- value
cols[index] if (index == n.cols){
break
}
}
}close(con)
return(read.csv(filepath, comment.char="#", header=FALSE, col.names=cols))
}else{
return(read.csv(filepath, comment.char="#", header=FALSE, col.names=seq(1:n.cols)))
} }
So, now what we have to do is load our file. I also want to check my file using the str()
function. Please run ?str
to see what this function does. For any function that you do not understand, you can run the ?
operator to see the help pages and some examples.
<- read.pcibex("~/octo-recall-ibex.csv")
ibex str(ibex)
Now, what I want to do is to get to filenames that are recorded in the PCIbex results. Before doing that, I advise you to go to this documentation and read more about PCIbex under the Basic Concepts header.
Column | Information |
---|---|
1 | Time results were received (seconds since Jan 1 1970) |
2 | MD5 hash identifying subject. This is based on the subject’s IP address and various properties of their browser. Together with the value of the first column, this value should uniquely identify each subject. |
3 | Name of the controller for the entity (e.g. “DashedSentence”) |
4 | Item number |
5 | Element number |
6 | Label. Label of the newTrial() |
7 | Latin.Square.Group. The group they are assigned to. |
8 | PennElementType. Name of the specific element, like “Html”, “MediaRecorder” |
9 | PennElementName. Name we have given to the specific Penn Elements. |
10 | Parameter. This is about what type of element the script is running and saving as a parameter. |
11 | Value. Value saved for the parameters in column 10. |
12 | EventTime. Time that specific Element is screened or any action taken with that Element (seconds since Jan 1 1970) |
Filter the results
Since we are dealing with media recordings, I will first filter the file using the PennElementType
column and will only select rows with “MediaRecorder” values in that column.
<- ibex |> filter(PennElementType == "MediaRecorder")
ibex unique(ibex$Value)[1:3]
After checking my Value
column where the file names for MediaRecorder
are stored, I realize that this will not be enough given that we still have other unwanted elements like test-recorder.webm
file or some practice files. There are multiple ways to get rid of these files, and you have to think about how to get rid of them for your own specific dataframe. For my own data, I will filter my data utilizing the labels I provided in my PCIbex code. They are stored in the Labels
column. What I want is to only get the MediaRecorders
that are within a trial whose label starts with the word trial
.
You may have coded your data differently; you may have used a different word; you may not even have any practice or test-recorders, so maybe you do not even need this second filtering. Check your dataframe using the View()
function. I am also using a function called str_detect()
, which detects a regular expression pattern, in this case ^trial
, meaning starting with the word trial. Now, when I check my dataframe, I will only see experimental trials and recordings related to those trials. Just to make sure, I am also using the unique()
function so that I do not have repetitions. And, I am assigning my filenames to a list called ibex_files
. You can see that any random sample with the sample()
function will give filenames related to experimental trials.
<- ibex |> filter(str_detect(Label, "^trial"))
ibex <- ibex$Value |> unique()
ibex_files sample(ibex_files, 3)
Operators used in this section
?
Opens the help page for any function.
example use: ?library()
==
Test for equality. Don’t confuse with a single =, which is an assignment operator (and also always returns TRUE).
example use: ``
|>
(Forward) pipe: Use the expression on the left as a part of the expression on the right.
- Read
x |> fn()
as ‘usex
as the only argument of functionfn
’. - Read
x |> fn(1, 2)
as ‘usex
as the first argument of functionfn
’. - Read
x |> fn(1, ., 2)
as ‘usex
as the second argument of functionfn
’.
example use: ``
Our Task List
Load the PCIbex results- Filter irrelevant sound files
- Move all of our
.wav
and.TextGrid
files to the same directory - Rename our files according to MFA guidelines
- Run MFA
- Create a dataframe
Filtering Files from the Server
Now that we have a gold list, we can go ahead and filter our files from the server according to our gold list, based on the results from PCIbex. To do this, first we will create a temporary folder called gold
. We will strip every file name of its extension .wav
or .TextGrid
and check if that name exists in our gold list. If that is the case, we will move the file. To this end, we are going to use something called for loops and if statements. You can click on the hyperlinks to watch more on them. 1
- First, we need to list all of our files. I have all my
.wav
and.TextGrid
files in the same place, so I just need to list a single directory. You might have them in different places. Check the commented-out part. To only get relevant files, I am using a regular expression saying only choose the ones that end ($
) with.wav
or (|
).TextGrid
using the pattern argument. - Second, we create our gold directory with the
dir.create()
function. - Third, we iterate over every file in the files list, get its name without its extension using the
tools::file_path_sans_ext()
function and check whether it exists in our gold list using the%in%
operator. - If it is also present in the gold list, we specify where the file exists and assign it to a variable called
old_file_path
. We also specify where it should be moved and assign it to a variable callednew_file_path
. We use thefile.path()
function anddir
/gold_dir
variables, along with the filef
we are iterating over. - Lastly, we move the file using the
file.rename()
function. Even though the name of the function is rename, it can be used to move files. Every file exists with its pointer, such as~/data/important.R
. By changing this pointer, we can change its name, as well as its location to something like~/gold_data/really_important.R
. We did not only change the file name from important to really_important. We also changed other parts of this pointer, that is the folder part. If the foldergold_data
exists, it will be moved there. - To make sure this for loop works, I also print a message after every file movement using the
cat()
function.
<- "~/data"
dir <- "~/data/gold"
gold_dir # Use these lines if you have sound and transcriptions in different places.
# wav_dir <- "~/wav_data"
# tg_dir <- "~/tg_data"
<- list.files(data, pattern = "\\.wav$|\\.TextGrid")
files # Use these lines if you have sound and transcriptions in different places.
# tg_files <- list.files(wav_dir, pattern = "\\.wav$|\\.TextGrid")
# wav_files <- list.files(tg_dir, pattern = "\\.wav$|\\.TextGrid")
dir.create(gold_dir)
for (f in files) {
if (tools::file_path_sans_ext(f) %in% tools::file_path_sans_ext(ibex_files)) {
<- file.path(dir, f)
old_file_path <- file.path(gold_dir, f)
new_file_path file.rename(old_file_path, new_file_path)
cat("Moved ", f, " to ", new_file_path, "\n")
} }
One important thing to note here is that if you have your .wav
and .TextGrid
files in different directories, you can either manually move them to a single folder using your file management application. Alternatively, you can run the commands above using wav_dir
and tg_dir
variables, along with tg_files
and wav_files
variables. It should look like the following. There are, of course, better ways to solve this problem, and I leave that to your creativity.
<- "~/wav_data"
wav_dir <- "~/tg_data"
tg_dir <- "~/data/gold"
gold_dir
<- list.files(wav_dir, pattern = "\\.wav$|\\.TextGrid")
tg_files <- list.files(tg_dir, pattern = "\\.wav$|\\.TextGrid")
wav_files
dir.create(gold_dir)
for (f in tg_files) {
if (tools::file_path_sans_ext(f) %in% tools::file_path_sans_ext(ibex_files)) {
<- file.path(tg_dir, f)
old_file_path <- file.path(gold_dir, f)
new_file_path file.rename(old_file_path, new_file_path)
cat("Moved", f, "to", new_file_path, "\n")
}
}
for (f in wav_files) {
if (tools::file_path_sans_ext(f) %in% tools::file_path_sans_ext(ibex_files)) {
<- file.path(wav_dir, f)
old_file_path <- file.path(gold_dir, f)
new_file_path file.rename(old_file_path, new_file_path)
cat("Moved", f, "to", new_file_path, "\n")
} }
Operators used in this section
%in%
Test for membership
example use: ``
Our Task List
Load the PCIbex resultsFilter irrelevant sound filesMove all of our .wav and .TextGrid files to the same directory- Rename our files according to MFA guidelines
- Run MFA
- Create a dataframe
Renaming Files
This section is going to be the section you have to be most careful about. If you mess anything up in this section, you will have to delete everything, go back, unzip your files, convert them to .wav
from .webm
, and do everything in this file again. So, before giving you the for-loop to rename all files, I want to make sure that we go over one of the files and make sure we do it correctly.
The main reason we rename our files is because we want Montreal Forced Aligner to understand that we have multiple speakers in our dataset. If we do not do that, it will treat all files as if they are from a single speaker and probably will be very confused due to inter-speaker variance in speech. Since we have erroneously specified in our PCIbex files to use the subject_id
as a suffix rather than a prefix, we have to fix that. If in the future we fix that in our PCIbex script, we do not have to go over this part.
One-file example
My files, after unzipping, converting to .wav
, and filtering according to the gold list, look like the following. There are some important things to keep in mind here. First of all, now that everything is in our gold directory, we have to use the gold_dir
variable to list our files. Secondly, we again need to use the pattern argument to make sure we only select relevant files. The last thing to be aware of in the next code is that I am using indexing with square brackets to refer to the first elements in the list. I will use this element to first ensure that what I am doing is correct.
<- "~/data/gold"
gold_dir <- list.files(gold_dir, pattern = "\\.wav$|\\.TextGrid$")
files <- files[1]
example_file example_file
Get the extension and the name
Now that we have the name of an example file, we can start by extracting its extension. We will use the function file_ext
from the package called tools
. Sometimes, we do not want to load an entire package, but we want to access a single function. In those cases, we use the operator ::
. Additionally, we will use paste0
to prefix the extension with a dot, so that we can use it later when we rename our files.
<- tools::file_ext(example_file)
extension <- paste0(".", extension)
extension extension
As for the rest of the name, we will use the file_path_sans_ext()
function that we used earlier.
<- tools::file_path_sans_ext(example_file)
rest rest
Get the subject id
Now, the most important part is getting the subject name. If you look at what my rest
variable returned, you can see that it consists of the last 4 characters, which are also the last set of characters after the last underscore. There are multiple ways to extract the subject id. I will show you both methods so that you can choose and adapt them for your own data. For the underscore version, we will use the function str_split()
, and for the character counting, we will use str_sub()
.
Underscore approach
str_split()
takes a string and splits it according to the separator you provide. In our case, the separator is the underscore. We are also using an additional argument called simplify
to make the resulting elements more user-friendly. Our function now returns a small table with 1 row and 5 columns. To select the values in the 5th column, we use square brackets again, this time with a comma. When you apply this approach to your own data, remember that you may end up with fewer or more than 5 columns depending on your naming convention. Be sure to adjust the column number accordingly. It might also be the case that your subject id is not stored last or that your separators are not underscores but simple “-”. Modify the code according to your specific needs.
# Using the underscore information
<- str_split(rest, "_", simplify = TRUE)
subj
subj<- subj[,5]
subj subj
Lastly, we have to modify the rest
variable so that we do not include the subject id twice. I will use the same approach again. After obtaining the table, I will use the paste()
function to concatenate the columns back together with the underscore separator. Adjust the number of columns used in this function and the separator according to your own data needs.
<- str_split(rest, "_", simplify = TRUE)
nosubj <- paste(nosubj[,1], nosubj[,2], nosubj[,3], nosubj[,4], sep = "_")
nosubj nosubj
Character approach
str_sub()
allows you to extract a substring using indices. In my case, the subject id is the last four characters. To refer to characters from the end, you can use the minus symbol -
. I specify -4
in the start
argument, which means I want to extract the string starting from the fourth character counting back from the end.
<- str_sub(rest, start = -4)
subj subj
To get the rest of the filename, I specify the starting point as 1
and the endpoint as -6
. Using -5
would include the underscore as well.
<- str_sub(rest, start = 1, end = -6)
nosubj nosubj
Put the new name and the path together
At this point, we have everything we need: (i) the subject id prefix, (ii) the rest of our file name, and (iii) the extension. Now, we need to combine all of this together. We are going to use the paste0()
function. Remember, this function is different from paste()
. The main difference is that with paste0()
, we cannot specify separators; we have to provide everything. This might seem like a disadvantage at first, but it is beneficial for non-pattern cases like this.
<- paste0(subj, "_", nosubj, extension)
new_name new_name
We also need to create a new path to rename our file.
<- file.path(gold_dir, new_name)
new_path new_path
Rename the file
We will once again use the file.rename()
function. This time, we are only changing the file name and not the path, so the file will remain in its current location. We also need to obtain the full path of our example_file
. We can achieve this easily using the file.path
function again.
<- file.path(gold_dir, example_file)
example_file_path example_file_path
file.rename(example_file_path, new_path)
After running this, make sure the naming convention is as we want. Check your folder by searching for the trial. It should look something like subj_rest.wav
or subj_rest.TextGrid
. In my case, it is `, where
dltc` is my subject id or subj.
Little treat for you
I know that some of your files look like the following: squid_S_jtfr.wav
. Here, I will provide you with the code to rename this. Please check this code before using it. First, let’s arbitrarily assign this name to a variable. Remember, in your case, you will obtain this from your files
list.
<- "squid_S_jtfr.wav" example_file
Now, I am going to put all the code together in one chunk, except for moving. Also, be aware that I am using my own gold_dir
; please specify yours according to your needs. Additionally, be mindful of your operating system (Windows or Mac). If you are using Windows, your gold_dir
variable should look like the second line. I have commented out that part with a hashtag/pound symbol. Uncomment it by deleting the first pound symbol.
<- "~/data/gold"
gold_dir # gold_dir <- "C:/Users/utkuturk/data/gold" # for windows
<- tools::file_ext(example_file)
extension <- paste0(".", extension)
extension <- tools::file_path_sans_ext(example_file)
rest <- str_sub(rest, start = -4)
subj <- str_sub(rest, start = 1, end = -6)
nosubj <- paste0(subj, "_", nosubj, extension)
new_name <- file.path(gold_dir, new_name)
new_path new_path
[1] "~/data/gold/jtfr_squid_S.wav"
This would be your original example file path.
<- file.path(gold_dir, example_file)
example_file_path example_file_path
[1] "~/data/gold/squid_S_jtfr.wav"
And this line would handle the renaming from the old example_file_path
to the new_path
, thereby assigning the new name.
file.rename(example_file_path, new_path)
For loop
If you have ensured that the code above works correctly for you, you are now ready to implement the for loop. Within the loop, define a variable like f
and use it instead of example_file
. This way, you will iterate over every file in your list. To verify that it is functioning correctly, I also added a line to print a message each time a file is renamed.
<- "~/data/gold"
gold_dir # gold_dir <- "C:/Users/utkuturk/data/gold" # for windows
<- list.files(gold_dir, pattern = "\\.wav$|\\.TextGrid$")
files
for (f in files) {
<- tools::file_ext(f)
extension <- paste0(".", extension)
extension <- tools::file_path_sans_ext(f)
rest <- str_sub(rest, start = -4)
subj <- str_sub(rest, start = 1, end = -6)
nosubj <- paste0(subj, "_", nosubj, extension)
new_name <- file.path(gold_dir, new_name)
new_path <- file.path(gold_dir, f)
file_path file.rename(file_path, new_path)
cat("Renamed", f, "to", new_name, "\n")
}
Operators used in this section
df[selected_rows, indices_columns] or list[selected_element]
[], Indexing operator: Accesses specific rows and/or columns of a data frame. If it is a list, it only takes a single argument to select an element. Remember in R indices start with 1, unlike python.
selected_rows
A vector of indices or names.selected_columns
A vector of indices or names.selected_element
A vector of indices or names.
example use: files[1]
::
Double colon operator: Accesses functions and other objects from packages. Read x::y
as ‘function y from package x.’
example use: tools::file_ext()
Our Task List
Load the PCIbex resultsFilter irrelevant sound filesMove all of our .wav and .TextGrid files to the same directoryRename our files according to MFA guidelines- Run MFA
- Create a dataset
Running the Montreal Forced Aligner
Now that we have our files in the format we want, we can place all of our files in the MFA folder and start running the aligner. We can either move our files using the Explorer app and usual copy-paste or we can use the file.rename()
function again. Due to the aligner’s limitations, the second option is far better. The main constraint is that MFA starts encountering problems when we feed it more than 2000 files at once. Since I have a lot of data, I will use the following function to divide my files and move them into smaller subfolders. But before that, I will show you how to move files without dividing them into subfolders.
Moving the Files to MFA Directory
Without Dividing
Again, we are going to use the file.rename()
and dir.create()
functions to create the directory we are moving files to, and of course, to move files.
# gold directory, where all of our files are
<- "~/data/gold"
gold_dir # MFA directory
<- "~/Documents/MFA/mycorpus"
mfa_dir dir.create(mfa_dir)
# Files
<- list.files(gold_dir, pattern = "\\.wav$|\\.TextGrid$")
files for (f in files) {
<- file.path(gold_dir, f)
old_file_path <- file.path(mfa_dir, f)
mfa_path file.rename(old_file_path, mfa_path)
cat("Moved", f, "to", mfa_dir, "\n")
}
With Dividing them into subfolders
I will introduce the following function that I use. Here, I will not go into details, but it basically performs the following steps:
- Creates a subfolder called
s1
and moves files into it. - Counts up to 2000.
- When it surpasses 2000, it creates another subfolder by incrementing the number from
s1
tos2
. - Continues this process until there are no more files.
<- function(source, target, limit=2000) {
divide_and_move <- list.files(source, pattern = "\\.wav$|\\.TextGrid$", full.names = TRUE)
files <- unique(tools::file_path_sans_ext(basename(files)))
base_names <- 1
s_index <- 0
f_index <- file.path(target, paste0("s", s_index))
s_path dir.create(s_path)
for (b in base_names) {
<- files[grepl(paste0("^", b, "\\."), basename(files))]
rel_files
if (f_index + length(rel_files) > limit) {
<- s_index + 1
s_index <- file.path(target, paste0("s", s_index))
s_path dir.create(s_path)
<- 0
f_index
}
for (f in rel_files) {
file.rename(f, file.path(s_path, basename(f)))
}<- f_index + length(rel_files)
f_index
} }
You can use this function by simply providing the source and target folders.
# gold directory, where all of our files are
<- "~/data/gold"
gold_dir # MFA directory
<- "~/Documents/MFA"
mfa_main_dir dir.create(mfa_dir)
divide_and_move(gold_dir, mfa_main_dir)
Terminal Codes
After moving the files either with code or by hand to a specific MFA folder, ~/Documents/MFA
, we can start running the terminal commands. At this point, I assume you have gone through the MFA documentation for installation instructions. I am also assuming that you have used a conda
environment. If you haven’t, here are the 3 lines to install MFA.
Conda Installation in Terminal
conda activate base
conda install -c conda-forge mamba
mamba create -n aligner -c conda-forge montreal-forced-aligner
There are again two ways to do this. One way is to open your Terminal app or use the Terminal tab in the R console below. The other way, which I prefer more, is to execute commands using the R function system()
. I will first go over the easier one, which is using the Terminal app or the Terminal tab in R. But the reason I prefer the system()
function is that I can loop over multiple folders more easily that way, and I do not have to run my commands again and again.
Using Terminal
The first command we want to run is the conda environment code. Following the MFA documentation, I renamed my environment to aligner
. So, I start by activating that environment.
conda activate aligner
After activating the environment, I need to download three models: (i) an acoustic model to recognize phonemes given previous and following acoustic features, (ii) a dictionary to access pretrained phone-word mappings, and (iii) a g2p model to generate sequences of phones based on orthography. For all of these models, we are going to use the english_us_arpa
model. You can visit this website to explore various languages and models.
mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpa
mfa model download g2p english_us_arpa
After downloading these models, we are going to validate our corpus. There are many customizable parameters for this step. You can check them here. I am going to use my favorite settings here. You can interpret the following command like this: Dear Montreal Forced Aligner (mfa
), can you please analyze my files located in ~/Documents/MFA/mycorpus
and validate
them using the english_us_arpa
acoustic model and english_us_arpa
dictionary? Please also consider that I have multiple speakers, indicated by the first 4 characters (-s 4
). It would be great to use multiprocessing (--use_mp
) for faster execution. Lastly, please clean up previous and new temporary files (--clean --final_clean
).
mfa validate -s 4 --use_mp --clean --final_clean ~/Documents/MFA/mycorpus english_us_arpa english_us_arpa
This process will take some time. Afterward, you will have some out of vocabulary words found in your TextGrids. You can easily create new pronunciations for them and add them to your model.
The mfa g2p
command can take many arguments; here I am using only three. First, the path to the text file that has out of vocabulary words. This file is automatically created in your folder where your files are located. The path may vary depending on your system and folder naming, but the name of the .txt
file will be the same. In my case, it is ~/Documents/MFA/mycorpus/oovs_found_english_us_arpa.txt
. The second argument is the name of the g2p model. As you may recall, we downloaded it earlier, and its name is english_us_arpa
. Finally, the third argument is the path to a target .txt
file to store new pronunciations. I would like to store them in the same place, so I am using the following path: ~/Documents/MFA/mycorpus/g2pped_oovs.txt
.
mfa g2p ~/Documents/MFA/mycorpus/oovs_found_english_us_arpa.txt english_us_arpa ~/Documents/MFA/mycorpus/g2pped_oovs.txt
After creating the pronunciations, you can add them to your model with mfa model add_words
. This command takes the name of the dictionary as an argument (english_us_arpa
) and the output of the mfa g2p
command, which was a .txt
file storing pronunciations: ~/Documents/MFA/mycorpus/g2pped_oovs.txt
.
mfa model add_words english_us_arpa ~/Documents/MFA/mycorpus/g2pped_oovs.txt
The last step is the alignment process. It will align (mfa align
) the words and the phones inside our TextGrids stored in ~/Documents/MFA/mycorpus
using our previously downloaded dictionary (english_us_arpa
) and model (english_us_arpa
), and store the newly aligned TextGrids in a new folder called ~/Documents/MFA/output
.
mfa align ~/Documents/MFA/mycorpus english_us_arpa english_us_arpa ~/Documents/MFA/output
Using R
We can also accomplish all of this in R. One advantage of this approach is that it allows us to iterate over multiple subfolders more easily, which can be useful if we have more than 2000 files. We will use four components:
system()
function to execute terminal commands,paste()
function to create multiline templates,%s
string placeholder to create template codes,sprintf()
function to format our templates.
Introduction to sprintf()
and %s
Before going further with MFA codes, let me illustrate with an example. Suppose we have a list of folder names, and we want to create a .txt
file in each of these folders. We can use the system()
function to perform this action. Below, I define my folder_list
, then create paths for my .txt
files in each folder, such as ~/data1/mydocument.txt
. Afterwards, I generate a list of commands to create these files using touch
, which is a command-line tool for creating files. Finally, I execute these commands using the system()
function.
<- c("~/data1", "~/data2", "~/data3")
folder_list
<- paste(folder_list, "mydocument.txt", sep="/")
txt_list
txt_list<- paste("touch", txt_list, sep=" ")
command_list
command_list
for (command in command_list) {
system(command)
}
Technically, we didn’t need to use a for loop; instead, we could have concatenated all these commands with ;
and run a single system command. Bash can execute multiple commands in a single line when separated by ;
.
<- paste(command_list[1],
concatenated_commands 2],
command_list[3],
command_list[sep=";")
system(concatenated_commands)
We could achieve the same without needing a folder list by utilizing the %s
placeholder and the sprintf()
function.
<- "touch %s/mydocument.txt"
command_template <- paste(sprintf(command_template, "~/data1"),
concatenated_commands sprintf(command_template, "~/data2"),
sprintf(command_template, "~/data3"),
sep=";")
system(concatenated_commands)
This approach becomes particularly useful when dealing with multiple placeholders within the same command. For instance, the command template will replace the first %s
with the first argument, such as ~/data1
, and the second %s
with the second argument, like mydoc1
, when formatted using sprintf()
.
<- "touch %s/%s.txt"
command_template <- paste(sprintf(command_template, "~/data1", "mydoc1"),
concatenated_commands sprintf(command_template, "~/data2", "mydoc2"),
sprintf(command_template, "~/data3", "mydoc3"),
sep=";")
system(concatenated_commands)
Running MFA in R
Next, we’ll consolidate the previous code by concatenating it using paste()
and separating commands with ;
. If needed, we’ll incorporate placeholders. Each line will be assigned to a new variable, and then they’ll be combined into a single command string using paste()
. Finally, we’ll execute the command string using system()
with the argument intern = TRUE
to capture the output into an R variable, which allows for later inspection.
<- "conda activate aligner"
conda_start <- "mfa model download acoustic english_us_arpa"
get_ac <- "mfa model download dictionary english_us_arpa"
get_dic <- "mfa model download g2p english_us_arpa"
get_g2p
<- paste(conda_start, get_ac, get_dic, get_g2p, sep = ";")
mfa_init
<- system(mfa_init, intern = TRUE) mfa_init_output
After initializing the model, the next step involves validation. Again, I’ll use the same approach and concatenate the commands together. However, sometimes we may have too many files and need to use subfolders. To accommodate this, I’ll use %s
placeholders. The validation command has one placeholder for different subfolders. Similarly, our pronunciation creation for g2p has two placeholders, though they’ll be filled with the same value. Lastly, the add_words
command will use a single placeholder. Fortunately, all these folders are the same, so we can reuse the same variable repeatedly.
<- "conda activate aligner"
conda_start
<- "mfa validate -s 4 --use_mp --clean --final_clean ~/Documents/MFA/%s english_us_arpa english_us_arpa"
validate <- "mfa g2p ~/Documents/MFA/%s/oovs_found_english_us_arpa.txt english_us_arpa ~/Documents/MFA/%s/g2pped_oovs.txt"
g2p_words <- "mfa model add_words english_us_arpa ~/Documents/MFA/%s/g2pped_oovs.txt"
add_words
<- paste(conda_start, validate, g2p_words, add_words, sep = ";") mfa_val
Since this step takes longer and there’s more room for errors, I want to save all my outputs in a list. First, I need to identify which folders exist in my MFA directory. Because my divide_and_move
function prefixes every subfolder with s
, I’ll use ^s
to filter for relevant folders.
<- list()
output_val
# Define the base path where folders are located
<- "~/Documents/MFA"
base_path <- list.dirs(base_path, recursive = FALSE, full.names = FALSE)
folders <- folders[str_detect(folders, "^s")]
folders
folders
Now, we can iterate over this list of folders using a for loop. First, we create a temporary script using sprintf()
with four placeholders. Next, we execute the current script and save the output in a temp_output
variable. Later, we assign this output to specific output_name
variables for each folder using paste0()
and assign()
functions.
for (f in folders) {
<- sprintf(mfa_val, f, f, f, f)
cur_mfa_val
<- system(cur_mfa_val, intern = TRUE)
temp_output
<- paste0("output_val_", f)
output_name
assign(output_name, temp_output, envir = .GlobalEnv)
}
Now you can check the outputs by calling specific variables like output_val_s1
or output_val_s2
. After this step, the only task remaining is to run the aligner. We will create a template again, iterate over folders, and assign outputs to their respective names for verification. Meanwhile, the bash code will execute in the background. This time, our placeholders will refer to different inputs and an output folder. Fortunately, we can use the same output folder for every subfolder, so instead of using two placeholders, we’ll use a single %s
placeholder.
<- "conda activate aligner"
conda_start
<- "mfa align ~/Documents/MFA/%s english_us_arpa english_us_arpa ~/Documents/MFA/output"
align
<- paste(conda_start, align, sep = ";")
mfa_align
for (f in folders) {
<- sprintf(mfa_align, f)
cur_mfa_align <- system(cur_mfa_align, intern=TRUE)
temp_output <- paste0("output_align_", f)
output_name assign(output_name, temp_output, envir = .GlobalEnv)
}
This for loop completes the MFA alignment. There is one final task remaining: creating a dataframe for further data analysis.
Our Task List
Load the PCIbex resultsFilter irrelevant sound filesMove all of our .wav and .TextGrid files to the same directoryRename our files according to MFA guidelinesRun MFA- Create a dataframe
Dataframe Creation
Now that we have all our .TextGrid
files aligned, we can create a dataframe for subsequent statistical analyses using the readtextgrid
package. First, I’ll demonstrate the process for a single file. Later, I’ll explain how to extend this to an entire directory. Let’s begin by specifying our directory and listing the files. You can view the first few elements of a list or dataframe using the head()
function.
# Define the directory and list files
<- "~/Documents/MFA/output"
tg_dir <- list.files(path = tg_dir, pattern = "\\.TextGrid$")
file_list head(file_list, n = 5)
One-file example
Again, let’s work with an example file from our list, starting with the first file [1]
. First, we’ll retrieve its full file path. Then, we’ll use the read_textgrid()
function to create a dataframe for this single file. I’ll print the structure of the dataframe to give you a clearer view of its contents.
<- file_list[1]
example_file <- file.path(tg_dir, example_file)
file_path <- readtextgrid::read_textgrid(file_path)
example_df str(example_df)
In this project, which involves aligning words, we are interested in only a couple of these columns. Specifically, we focus on the file identifier (file
) to determine the trial from which the data originates, the tier name (tier_name
) to differentiate between word and phone tiers, the start (xmin
) and end (xmax
) of each interval, and finally, the text
. Additionally, I am not interested in retaining the file extension in the file
identifier. Therefore, we will first filter to include only annotated words, then select the important columns using select()
, remove the .TextGrid
extension, and concatenate the words so that we can see the full response for each trial.
<- example_df |>
example_df # Filter annotated "words" tier
filter(tier_name == "words" & text != "") |>
# Select relevant columns
select(file, xmin, xmax, text, annotation_num) |>
# Remove .TextGrid and put the response together
mutate(file = str_remove(file, "\\.TextGrid$"),
response = paste(text, collapse = " "))
example_df
We also need some information about the trial. Luckily, all of our information is provided in our file name. So, I am going to parse that name to create a dataframe with more information. I am using a set of function that all start with separate_wider_
.
- The
delim
version uses a deliminator to split a row of a dataframe. - The
regex
version uses regular expressions to split the data. - Finally, the
position
version uses the number of characters to split the data.
I am doing all of this because of how I initially coded my experiment output in my PCIbex script. You may need to change this code to process your own data.
<- example_df |>
example_df # split the `file` column into 5 different columns.
separate_wider_delim(file, "_", names = c("subj", "exp", "headVerb", "NumNum", "sem_type"), cols_remove = F) |>
# split the headVerb column from the "U" character
separate_wider_regex(headVerb, c(head = ".*", "U", verb_type = ".*")) |>
# Add the "U" character back to "Unacc" and "Unerg"s
mutate(verb_type = paste0("U", verb_type)) |>
# split the head and distractor numbers.
separate_wider_position(NumNum, c(head_num = 2, dist_num = 2))
example_df
Another Treat for you
Let’s see what you will need to do. You will find your file in the MFA/output
folder as well, and your file will look like jtfr_squid_S.TextGrid
. Let’s arbitrarily put them here. Remember, you will have to use the file.list()
function as well. You will not need to change anything in the first part where we work on the TextGrid. The necessary changes will need to be done in the parsing procedure. Instead of using the entire regex
or position
methods, you will just need to use the delim
version of the function.
# Define the directory and list files
<- "~/Documents/MFA/output"
tg_dir <- "jtfr_squid_S.TextGrid"
your_file
<- file.path(tg_dir, your_file)
file_path
### LETS SAY YOU RAN read_textgrid function.
### I commented out this part, because I do not have your data.
### Final dataframe will be slightly different here, because I do not have the textgrid data here.
# your_df <- readtextgrid::read_textgrid(path = file_path) |>
# filter(tier_name == "words" & text != "") |>
# select(file, xmin, xmax, text, annotation_num) |>
# mutate(file = str_remove(file, "\\.TextGrid$"),
# response = paste(text, collapse = " "))
#
<- your_df |>
your_df separate_wider_delim(file, "_", names = c("subj", "head", "condition"), cols_remove = F)
your_df
subj | head | condition | file | xmin | xmax | text | annotation_num | response |
---|---|---|---|---|---|---|---|---|
jtfr | squid | S | jtfr_squid_S | squid | 1 | squid jumped over the fence | ||
jtfr | squid | S | jtfr_squid_S | jumped | 2 | squid jumped over the fence | ||
jtfr | squid | S | jtfr_squid_S | over | 3 | squid jumped over the fence | ||
jtfr | squid | S | jtfr_squid_S | the | 4 | squid jumped over the fence | ||
jtfr | squid | S | jtfr_squid_S | fence | 5 | squid jumped over the fence |
For-loop
Now, we need to apply this process to all files in our output directory. To simplify this, I’ll start by creating a function for processing each file individually and then apply it to all files. The function takes a file name and its directory as inputs and returns a dataframe. Before creating each dataframe, it prints “Reading the file.” to indicate progress.
<- function(file, directory) {
process_textgrid cat("Reading", file, "\n")
<- file.path(directory, file)
file_path <- readtextgrid::read_textgrid(path = file_path) |>
df filter(tier_name == "words" & text != "") |>
select(file, xmin, xmax, text, annotation_num) |>
mutate(file = str_remove(file, "\\.TextGrid$"),
response = paste(text, collapse = " "))
<- df |>
df separate_wider_delim(file, "_", names = c("subj", "exp", "headVerb", "NumNum", "sem_type"), cols_remove = F) |>
separate_wider_regex(headVerb, c(head = ".*", "U", verb_type = ".*")) |>
mutate(verb_type = paste0("U", verb_type)) |>
separate_wider_position(NumNum, c(head_num = 2, dist_num = 2))
return(df)
}
This is essentially the same process as before, but encapsulated within a function for easier application. Here’s an example of how I’m redefining my directory and file list:
# Define the directory and list files
<- "~/Documents/MFA/output"
tg_dir <- list.files(path = tg_dir, pattern = "\\.TextGrid$")
file_list <- file_list[1]
example_file process_textgrid(example_file, tg_dir)
Your version will look like this.
<- function(file, directory) {
process_textgrid cat("Reading", file, "\n")
<- file.path(directory, file)
file_path <- readtextgrid::read_textgrid(path = file_path) |>
df filter(tier_name == "words" & text != "") |>
select(file, xmin, xmax, text, annotation_num) |>
mutate(file = str_remove(file, "\\.TextGrid$"),
response = paste(text, collapse = " "))
<- df |>
df separate_wider_delim(file, "_", names = c("subj", "head", "condition"), cols_remove = F)
return(df)
}
# Define the directory and list files
<- "~/Documents/MFA/output"
tg_dir <- list.files(path = tg_dir, pattern = "\\.TextGrid$")
file_list <- file_list[1]
example_file process_textgrid(example_file, tg_dir)
Now, we need to integrate this function into our for-loop. Instead of using a single file like file_list[1]
, we will apply it to an entire directory. Unlike previous for loops, we will use the map
function from the purrr
package. It is faster and easier to use in cases like this. map()
will return all of our dataframes embedded in a list. After using map()
, we need to combine all these smaller dataframes into a larger one using bind_rows
.
# Define the directory and list files
<- "~/Documents/MFA/output"
tg_dir <- list.files(path = tg_dir, pattern = "\\.TextGrid$")
file_list # Run our function, I am using mine, you should use your own.
<- function(file, directory) {
process_textgrid cat("Reading", file, "\n")
<- file.path(directory, file)
file_path <- readtextgrid::read_textgrid(path = file_path) |>
df filter(tier_name == "words" & text != "") |>
select(file, xmin, xmax, text, annotation_num) |>
mutate(file = str_remove(file, "\\.TextGrid$"),
response = paste(text, collapse = " "))
<- df |>
df separate_wider_delim(file, "_", names = c("subj", "exp", "headVerb", "NumNum", "sem_type"), cols_remove = F) |>
separate_wider_regex(headVerb, c(head = ".*", "U", verb_type = ".*")) |>
mutate(verb_type = paste0("U", verb_type)) |>
separate_wider_position(NumNum, c(head_num = 2, dist_num = 2))
return(df)
}
<- map(file_list, process_textgrid, directory = tg_dir)
dfs
<- bind_rows(dfs) final_df
This completes our MFA Aligning work. We have successfully completed every task on our list, aligned our data, and created a dataframe for analysis. You can check the structure of our final dataframe, the number of rows, and the count of unique trials.
str(final_df)
nrow(final_df)
length(unique(final_df$file))
An Example Descriptive Summary
Let me also show a small example of descriptive statistics using basic functions. One thing we might want to do is check whether there is a difference between conditions in people’s time to start uttering the sentence. Let’s assume people are still thinking about the sentence when they utter the first determiner “the,” since it is kind of automatic in English and all of our sentences start with this determiner anyway. I am going to filter my dataframe using the text
column. Additionally, I want to ensure it is the first occurrence of “the” and not any other “the” in the sentences, so I will use the annotation_num
column in my filtering.
<- final_df |> filter(text == "the" & annotation_num < 3) first_thes
Then, what I am going to do is summarize my dataframe. I will group my dataframe by columns verb_type
and sem_type
since my experiment had 2 different verb types (unergative, unaccusative) and 2 different types of semantic distractors (related, unrelated). Then, I am going to use the summarize()
function to get the mean, standard error, and credible interval for each condition. Since we are assuming people are still planning sentences while they utter the first determiner, I am going to summarize my data using the offset of the the (xmax
). Before doing this, I am going to convert xmax
from seconds to milliseconds.
|>
first_thes mutate(xmax = xmax*1000) |>
group_by(verb_type, sem_type) |>
summarise_each(funs(mean, se=sd(.)/sqrt(n()), ci=se*1.96), xmax)
From this point on, I leave the modeling and plotting of the data for another guide.
Footnotes
On principle, I am against for loops in R, but it is better to use here instead of confusing you more.↩︎