#534: Extractions from language elements

opened by tcatapano

From @thuchacz

Extract: 1. The contents of all <la></la> tags from either the TC, TCN, or TL (there are 109 instances in each of the 3 versions, so I don't think it matters which you use) 1. I need these because I need to assign to TT and CAG the translation of Latin words, phrases, sentences and also check them against existing <comment> tags. Extraction would help me immensely.

  1. The contents of all <fr></fr> tags from the TL. What I need here is trickier. Before the merges you just committed, there were 1660 <fr> tags used in the TL, 789 of which were <del><fr></fr></del> nests, which I'm not interested in at all. Is there a way for you to provide me the strings inside <fr></fr> but not in the <del><fr></fr></del>? If not, I can work around this (provided they are extracted in order they appear in the folios), but it will simply take me longer.

tcatapano commented:

@thuchacz here are the lists: french.csv.txt latin.csv.txt

FYI, I used these XPATH expressions: for $la in //la return concat(replace($la, '\n', ' ' ), '|', $la/preceding::page[1] ) and

for $fr in //fr[not(parent::del)] return concat(replace($fr, '\n', ' ' ), '|', $fr/preceding::page[1] )