#1726: entry-metadata missing values

opened by njr2128

While reviewing values in https://github.com/cu-mkp/m-k-manuscript-data/blob/master/metadata/entry_metadata.csv, we discovered a few issues that have resulted in missing or modified values of tagged terms.

To summarize the desired spec:

For each semantic tag type (al, bp, m, ms, tl, etc.), populate a cell with every verbatim term per entry. This is NOT unique terms, but every term (e.g., if a phrase is repeated verbatim, include it as many times as it appears). Create one cell each for tc, tcn, and tl per tag (i.e., tcm, tcnm, tl_m for m (material)).

missing and clipped values

Use div id 005r2 as a test case. Entry-xml here: https://github.com/cu-mkp/m-k-manuscript-data/blob/master/entries/xml/tl/tl005r_2.xml

Current values for bp (body part) are:

bptc | bptcn | bp_tl -- | -- | -- visaige; teston; lieulx secrets; poil de la barbe; doigt; petit doigt | visaige; teston; lieulx secrets; poil de la barbe; doigt; petit doigt | finger; face; nipple; hair; secret place

While there are 7 strings in https://github.com/cu-mkp/m-k-manuscript-data/blob/master/entries/xml/tl/tl005r2.xml, tc and tcn are only picking up 6 and tl is 5. - update.py seems to be displaying unique terms only (e.g., "finger" is tagged twice but only appears once: representation with your <bp>finger</bp>, another <bp>finger</bp> - another problem either with some form of normalization or tags wrapped within another tag (e.g., "little finger" is missing - perhaps because it is also within <ms> tags or because it is filed under "finger") your <ms><bp>little finger</bp></ms>, perfectly round - regex problems with the character "s" (everything after an "s" is clipped due to regex white space character) <bp>hairs of your beard</bp> --> hair <bp>secret places</bp> --> secret place

nested tags issues

Use div id 013r4 as a test case: https://github.com/cu-mkp/m-k-manuscript-data/blob/master/entries/xml/tl/tl013r_4.xml

Current values for m (material) are:

mtc | mtcn | m_tl -- | -- | -- farine; poil de cheval; estamine de soye crue; soye creue; fumee du soufre; soye; soye crue jaulne & naturelle; soye crue | farine; poil de cheval; estamine de soye crue; soye creue; fumee du soufre; soye; soye crue jaulne & naturelle; soye crue | flour; smoke; tammy of raw silk; horsehair; silk; yellow raw & natural silk

While there are 9 material strings in https://github.com/cu-mkp/m-k-manuscript-data/blob/master/entries/xml/tl/tl005r2.xml, tl is only picking up 7 (tc and tcn seem to be picking up 9). - Here we have multiple nested m tags which are not being returned as separate values. For example, the string <m><m>raw silk</m> whitened by <m>sulfur smoke</m></m> should return: 1. raw silk whitened by sulfur smoke; 2. raw silk; 3. sulfur smoke


tcatapano commented:

re: nested tags, for a particularly complicated example:

https://github.com/cu-mkp/m-k-manuscript-data/blob/32582b93455546cdefacadd5c7206370902748fa/ms-xml/tc/tcp139vpreTEI.xml#L38-L41

that is, m/tl/m


tcatapano commented:

re: nested tags: another edge case

https://github.com/cu-mkp/m-k-manuscript-data/blob/32582b93455546cdefacadd5c7206370902748fa/ms-xml/tc/tcp164vpreTEI.xml#L109-L111

multiple <m>'s in an extended parent <m>


gschare commented:

Missing and clipped values:

Apparently solved by setting apply_corrections=False when generating the manuscript object in update.py. When True, this tells the manuscript to use the thesaurus corrections on the TL version, which cut off and ignore many tags. @njr2128, could you peruse the updated entry-metadata.csv in my branch to see if there are remaining problems?

As noted, duplicate properties were being ignored because the Recipe object treats tags as a "set" (no duplicates) rather than a list. This was a simple one-line change and the problem went away.

Nested tags

No change was necessary to handle nested tags of different properties, but nesting the same properties got ignored because the regular expressions used to find the property tags were XML-unaware. Switching to an XML-aware parser (in this case BeautifulSoup with an lxml parser) solves the issue completely. I strongly suggest switching to the lxml Python package instead of regex for the rest of the manuscript-object functions. In addition to being aware of the XML, it includes lots of utility and control over encoding which would solve many other open issues, such as #1683, #1613, and possibly #1623 if we decide to use it to generate derivative .txt files. Indeed, Terry and I noticed while testing that in entry-metadata.csv, &amp; was converted to &.

Other

Make sure to use the update.py in m-k-manuscript-data from now on; the version in manuscript-object is waiting to be removed in a pull request.

It was kinda painful to wait for the entire manuscript to regenerate every single time; maybe we should look into some form of caching, e.g. with Python Pickle or JSON for instances where we are testing functionality but not changing the original files?