A treasure trove of Welsh vocabulary
The developers' first step was to collect almost 500 Welsh children's books, ranging in difficulty from the elementary pages of Sali Mali to key stage 3 history text books, although the majority were key stages 1 and 2 fiction. The books then passed through a panel of teachers and language experts who tagged them at national curriculum levels 1-5, before being scanned into a database, or corpus, until the number of words exceeded 2 million, the largest corpus of Welsh vocabulary ever produced. Following a lengthy period of correcting the spellings which had not scanned properly the corpus was ready for analysis.
Its most basic use is the production of word frequency lists. One complication in word frequencies in Welsh is that the basic word can change when the initial consonant is "mutated", as when "cath" (cat) becomes "ei gath" (his cat) or "fy nghath" (my cat). However, this is no bother for the corpus as mutated forms can be included separately or recorded under the basic "cath". Different verb forms such as "aeth" (heshe went) and "af"(I shall go), can also be counted separately or gathered under the verb-noun "mynd" (to go).
The frequency lists are often revealing. In presenting a new language to learners, the emphasis is often placed on nouns, yet of the most frequent 200 words in this large sample only 20 are nouns. The two commonest are "mam" and "dad", virtually proper nouns, while others can be used as prepositions or adverbially, such as "lawer tro" (many times). Boys' names are much more common than girls', suggesting that boys are more often the main characters in children's books. Incidentally, the most popular names for boys are Tomos and Huw, and for the girls Catrin and Llio.
Each word was tagged with the book it came from, so word frequencies can be established for particular NC levels. The frequency of word combinations is also available, and so we see that the most frequent combinations of the adjective "bach" (small) are "yn ddistaw bach" (quietly), "y ty bach" (the little house - or toilet), and "y mochyn bach", the little pig, which probably comes in threes.
But Egni is more than just a plaything for linguists. Amongst its practical uses are:
* Identifying a basic first 100 words and first 200 words list for first steps in reading
* Helping authors and translators of children's books to target their work at particular age groups
* Indicating the reading age and difficulty level of fictional material
* Helping to establish appropriate readability levels for Welsh-medium text books in different subjects
* Providing guidelines for the creation of Welsh reading tests which reflect the use of the language in real books.
Teachers will surely discover other uses for this remarkable resource to which will be added further books as they are published. For more information visit www.egni.org.
Robat Powell is head of NFER's Welsh Unit