While reviewing values in https://github.com/cu-mkp/m-k-manuscript-data/blob/master/metadata/entry_metadata.csv, we discovered a few issues that have resulted in missing or modified values of tagged terms.
For each semantic tag type (al, bp, m, ms, tl, etc.), populate a cell with every verbatim term per entry. This is NOT unique terms, but every term (e.g., if a phrase is repeated verbatim, include it as many times as it appears). Create one cell each for tc, tcn, and tl per tag (i.e., tcm, tcnm, tl_m for m (material)).
Use div id 005r2 as a test case. Entry-xml here: https://github.com/cu-mkp/m-k-manuscript-data/blob/master/entries/xml/tl/tl005r_2.xml
Current values for bp (body part) are:
bptc | bptcn | bp_tl -- | -- | -- visaige; teston; lieulx secrets; poil de la barbe; doigt; petit doigt | visaige; teston; lieulx secrets; poil de la barbe; doigt; petit doigt | finger; face; nipple; hair; secret place
While there are 7 strings in https://github.com/cu-mkp/m-k-manuscript-data/blob/master/entries/xml/tl/tl005r2.xml, tc and tcn are only picking up 6 and tl is 5. - update.py seems to be displaying unique terms only (e.g., "finger" is tagged twice but only appears once: representation with your <bp>finger</bp>, another <bp>finger</bp>
- another problem either with some form of normalization or tags wrapped within another tag (e.g., "little finger" is missing - perhaps because it is also within <ms>
tags or because it is filed under "finger") your <ms><bp>little finger</bp></ms>, perfectly round
- regex problems with the character "s" (everything after an "s" is clipped due to regex white space character) <bp>hairs of your beard</bp>
--> hair <bp>secret places</bp>
--> secret place
There are many instances throughout the manuscript where different tags are nested within each other, like the <ms><bp>
used above in 5r. These should be checked to make sure they are picked up as both ms and bp (i.e., populating both the cell for ms and and the cell for bp).
More complicatedly, there are nested tags of the same type (i.e., <bp><bp>
)
Use div id 013r4 as a test case: https://github.com/cu-mkp/m-k-manuscript-data/blob/master/entries/xml/tl/tl013r_4.xml
Current values for m (material) are:
mtc | mtcn | m_tl -- | -- | -- farine; poil de cheval; estamine de soye crue; soye creue; fumee du soufre; soye; soye crue jaulne & naturelle; soye crue | farine; poil de cheval; estamine de soye crue; soye creue; fumee du soufre; soye; soye crue jaulne & naturelle; soye crue | flour; smoke; tammy of raw silk; horsehair; silk; yellow raw & natural silk
While there are 9 material strings in https://github.com/cu-mkp/m-k-manuscript-data/blob/master/entries/xml/tl/tl005r2.xml, tl is only picking up 7 (tc and tcn seem to be picking up 9). - Here we have multiple nested m tags which are not being returned as separate values. For example, the string <m><m>raw silk</m> whitened by <m>sulfur smoke</m></m>
should return: 1. raw silk whitened by sulfur smoke; 2. raw silk; 3. sulfur smoke
re: nested tags, for a particularly complicated example:
https://github.com/cu-mkp/m-k-manuscript-data/blob/32582b93455546cdefacadd5c7206370902748fa/ms-xml/tc/tcp139vpreTEI.xml#L38-L41
that is, m/tl/m
re: nested tags: another edge case
https://github.com/cu-mkp/m-k-manuscript-data/blob/32582b93455546cdefacadd5c7206370902748fa/ms-xml/tc/tcp164vpreTEI.xml#L109-L111
multiple <m>
's in an extended parent <m>
Apparently solved by setting apply_corrections=False
when generating the manuscript object in update.py
. When True, this tells the manuscript to use the thesaurus corrections on the TL version, which cut off and ignore many tags. @njr2128, could you peruse the updated entry-metadata.csv
in my branch to see if there are remaining problems?
As noted, duplicate properties were being ignored because the Recipe object treats tags as a "set" (no duplicates) rather than a list. This was a simple one-line change and the problem went away.
No change was necessary to handle nested tags of different properties, but nesting the same properties got ignored because the regular expressions used to find the property tags were XML-unaware. Switching to an XML-aware parser (in this case BeautifulSoup with an lxml
parser) solves the issue completely. I strongly suggest switching to the lxml
Python package instead of regex for the rest of the manuscript-object functions. In addition to being aware of the XML, it includes lots of utility and control over encoding which would solve many other open issues, such as #1683, #1613, and possibly #1623 if we decide to use it to generate derivative .txt
files. Indeed, Terry and I noticed while testing that in entry-metadata.csv
, &
was converted to &
.
Make sure to use the update.py
in m-k-manuscript-data
from now on; the version in manuscript-object
is waiting to be removed in a pull request.
It was kinda painful to wait for the entire manuscript to regenerate every single time; maybe we should look into some form of caching, e.g. with Python Pickle or JSON for instances where we are testing functionality but not changing the original files?