My work at the Making and Knowing Project

Roni Kaufman, Ecole Polytechnique

March 30, 2020 - July 31, 2020

I would like to express my deepest gratitude to all the Project members that I had the chance to work with, Pamela H. Smith, Naomi Rosenkranz, Terry Catapano, Dana Chaillard, Matthew Kumar, Gregory Schare, Clément Godbarge and Tianna Helena Uchacz. I truly thank you for working with me, for your constructive criticism and for trusting me. All my work can be found here: https://github.com/cu-mkp/manuscript-object/tree/v1.0-ronikaufman (originally, this was found on the branch context of my fork of the GitHub repository manuscript-object). My first task consisted in extracting the context from the text tagged with the semantic tags. I used Python, and in particular the third-party module lxml, to go through the XML manuscript¹. For each semantic tag (animal, body part, currency, environment, material, medical, measurement, music, plant, place, personal name, profession, sensory, tool, temporal, arms and armor) and each version of the manuscript (TC, TCN and TL), I generated a CSV file, where the columns are: “folio”, “text in tag”, “before” and “after”. The context is inside “before” and “after”, written as Python arrays of (maximum) 10 words preceding and following the tagged text. The code to create or update these tables is in the file context.py². The CSV files are saved in the directory context³ (which, like all directories named here, is created if it doesn’t already exist when the script is executed).

Then, I began visualizing this extracted data. I used the third-party Python module seaborn to create these, as well as all followings visualizations. The script is in file context_viz.py⁴ and the images are saved in the directory context_visualizations⁵. For each version of the manuscript, I created a bar plot showing the total number of different words in all the context arrays of each tag, and a heatmap visualizing, out of these words, what size is the intersection between every pair of tags. I later designed another heatmap to see the inclusion of contexts instead of the intersection. This second heatmap is referred to as “asymmetrical” in the file names (as opposed to “symmetrical” for the first kind). I also designed a normalized bar plot, in which the bar heights are divided by the total number of times the tag appears in the manuscript. In these plots, we can notice how similar or diversified is the author-practitionner’s vocabulary when talking about different topics. In order to understand the differences between the translations, I generated new versions of all these visualizations. For the bar plots, the three versions are grouped in one graph, and for the heatmaps, I simply subtracted their matrices between versions to create new ones. They are inside the subdirectory comparisons of context_visualizations⁶. Towards the end of my internship, using this base of code, I also generated all these heatmap plots for the words inside the tags, instead of their contexts. These are saved in the manuscript_visualizations directory⁷, because they do not regard the context but rather the whole manuscript. Still, it is interesting to compare them with the context heatmaps.

Following the observations of the differences between translations, I was given the idea of visualizing other languages in the English version throughout the manuscript. As this plot wasn’t related to the context, I wrote the code in a new file, manuscript_visualizations.py, with in the intention of adding in it the more visualizations later on. One execution generates all the plots inside manuscript_visualizations⁸. As with the context extraction, the data used is the XML versions of the manuscript. I first tried making a scatter plot, with the folios in order on the x-axis, the tag count on the y-axis and the language (French, Greek, Italian, Latin, Occitan and Poitevin) as the color hue of the points. But the plot was very hard to read, the data was too noisy. It was erased. In consequence, I tried visualizing this data as a bubble plot instead, with one line per language, the entries on the x-axis (which is more relevant than the folios) and the bubble size as the tag count, and the result was much better.

languages bubbles

In a normalized version of this plot, I mapped the bubble size to the tag count divided by the total number of words of the entry.

languages bubbles normalized

I also created a swarm plot to visualize this same data, with the entries on the x-axis, the categories on the y-axis and language as the color hue. On its background, all entries are drawn as lines, even if they don’t contain the tag of interest. This allows us to see, where there are no dots, whether it is because there are no entries with this tag or no entries in this category at this place in the manuscript.

languages swarmplot

I designed barplots as well, with the categories on the x-axis and the tag counts on the y-axis, one for each language.

Greek tag by category

Italian tag by category

Latin tag by category

Occitan tag by category

Poitevin tag by category

Then, I adapted these plots in order to make them easily created with different tags, and all code for the following visualizations was written with the same idea in mind. So, as other team members suggested, I generated them for additions and deletions, for margins, for semantic tags and for figures and insertions marks. For the cases of additions/deletions and figures/marks, a different bar plot groups various tags in separate columns inside the same image in order to compare them.

Add+del tag by category

Figures+marks tag by category

Finally, for the case of semantic tags, I also created bar plots to see, in each manuscript version, how many words there are in average inside the two tag bounds.

TC semantic tags

TCN semantic tags

TL semantic tags

To visualize other properties of the manuscript not related to particular tags, I designed some more plots. One bar plot counts the number of entries tagged with each category.