I was very interested, and a little surprised, by the recent announcement by the Gaelic Algorithm Research Group about a Gaelic Linguistic Analyser which performs Part-of-Speech tagging, lemmatisation, and syntactic parsing. Surprised, because I knew of the work that had been done previously on automatic PoS tagging, but did not realise that things had developed considerably from then.
For quite a few years, I used Foreign Language Text Reader with my Gaelic reading, allowing me to tag and store glosses and other data for individual words, and multiword expressions. When migrating to a new computer recently, I sadly lost all my stored data on FLTR.
In the work I have been contributing to the Greek Learner Text Project, and in the many discussions I’ve had with James Tauber,, a lot of our shared interest comes back around to a set of a few questions:
- How do you help learners read more text, more easily?
- How do you select appropriate texts for readers?
- How do you build a platform that overcomes the difficulties of reading
Those discussions often involve a cyclical movement from pedagogy to interfaces to data. I think in an ideal world, you would have a reading interface/platform that (a) gave pop up information on all the words and phrases you needed help on, (b) had accurate tagging and data on all the words in lots of texts, (c) tracked words (and structures, syntax, etc. etc) that you were exposed to (e.g. not just a binary know/don’t know, but number of exposures, times you’d needed to click for help, time since last exposure, etc.), (d) could suggest new texts that required minimal steps of new vocab (or structures), e.g. ideally to keep you reading with a 98% recognition level.
This requires both a tools, such as those being developed for the Greek Learner Texts (which are generally language-independent), and a platform, such as being developed for Hedera, and it requires a corpus with relevant data, and/or the ability for learners to import their own texts. There already exists some a digital corpus for Gaelic texts, DASG, though it does not appear to be open access at all, nor is it clear what data is associated with it.
All of which is to say, I think we’re at the point where there is enough of a convergence of tools and resources that creating something like a learner-oriented Gaelic reading platform, and a database of texts, is more within reach than ever before. However, two particular obstacles remain: firstly the POS tagger is 91/95% accurate, depending on whether using a full or simplified tagset. This could be improved by hand-curating tagging, and feeding manually corrected tagging back to the GARG would probably be able to improve this over time. For the meantime, starting with computer tagged texts and correcting them remains necessary. I had previously made a small start on hand-tagging some texts, but it is very laborious, correcting computer-tagged texts should be a lot faster. Secondly, the copyright status of texts is an issue. For Ancient Greek, our great advantage is that texts were authored millenia ago, and many print editions are out-of-copyright. Providing contemporary Gaelic texts will require specific permissions. It would be great to see producers of publicly available material (e.g. LearnGaelic.scot) include licensing permission for reuse of texts for a project like this.
For my part, I plan to make use of the new Linguistic Analyser to start analysing some texts and producing some curated datasets of my own, to then test and integrate with tools from the Greek Learners Text Project.
If you’d be interested in collaborating on any of this from the Gaelic side, please do get in touch: thepatrologist @ gmail.com