Croatian Corpora as Open Educational Resources for Language Research and Learning




Download presentation

Aim: Changes have started with term of open source software 1 which has brought a significant changes into the field of software production. For some period of time the term was reserved for information technology (IT) industry but idea of openness spread to all other fields of human interest,education, science, publishing, etc. In the field of education, especially, language learning there are open systems and its tools which offers free access to large language databases and materials called corpora. Main aim of this abstract is to present three largest Croatian computer corpora which have open access and can be used as open educational resources for language teachers, lecturers,researches, lexicographers, linguists and students. All conclusions and statements presented in this abstract are based of PhD research and thesis about influence of computer corpora in learning Croatian as second and foreign language 2 .

Methods: One of the definition of corpus states that corpus “is a collection of texts, written or spoken,which is stored on a computer” (1) and other author states that corpus is “a systematic collection of naturally occurring texts”(2). For Croatian language there are three corpuses that do not charge fee for their use and are available online. The biggest corpus is Croatian Web Corpus (hrWaC) with more than 1.9 billion tokens and it is built from the .hr top-level domain. Corpus is lemmatized and morphosyntactically annotated (3) so it can be used for different types of linguistic researches.Croatian National Corpus (HNK) is example of balanced corpus that consists of filtered content that is organized into several categories which include Croatian literary work and newspapers, and journals.He is also divided into two main parts regarding the certain time period, content that has been created from year 1990 till now and contend that has been created before year of 1990. Third corpus that is available online is Croatian Language Corpus 3 which consists of newspaper text and literary work from second half of 19th century till today. It is significantly smaller than hrWaC and HNK but it is interesting because he offers overview of language evaluation through certain period of time. hrWaC and HNKare available for users through software NoSketchEngine. It is important to emphasise that mentioned corpora are not only existing Croatian corpora but they are one of biggest corpuses which offer unlimited and free access for users.NoSketchEngine is an open-source project that is free version of commercial software SketchEngine which usage must be paid. One of main differences between those two software is in number of functionalities and tools that paid version (SketchEngine) offers. Main goal and purpose of both software is the same – providing users with large language database for language learning, conducting linguistics researches and language analysis. NoSketchEngine is “combining Manatee and Bonito intoa powerful and free corpus management system where Manatee is a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures and Bonito is a graphical user interface to corpora maintained by Manatee. It is a web interface written in Python which can be running under any webserver supporting the CGI 4 .” (4). It is example of free technology which is used for creating open language database for any language in the world. Main aim of these kind of open technology and its tools is free access for all interested users, mainly teachers, lecturers, students, researches, etc. to have accessible and free entrance to large language materials for educational and research purpose. NoSketchEngine must be downloaded, install, hosted and administered (4) if user wants to create his own corpus and put it online. For that to do user must have well developed technical skills and knowledge (e.g. programming skills). Usage of software and its tools is provided under the GNU GPL 5 license version 2. This license guarantee user the freedom to share and change free software and its main aim is to make sure the software is free for all its users.Preamble of GNU General Public Licence states that it is “designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.“ (5). The hrWaC and HNK are available thru NoSketchEngine interface and have free access to its materials and data. The hrWaC is distributed under the Creative Commons – Attribution-Share Alike 4.0 International (CC-BY-SA 4.0) license 6 which states that users can share and adapt materials for any purpose even commercially which means that users can download entire hrWaC corpus and use it. On the other hand HNK corpus has restricted use about downloading its materials and his licence is proprietary and its use is restricted for academic purpose only and it is not for commercial use. Distribution rights holder is University of Zagreb, Faculty of Humanities and Social Sciences. Croatian Language Corpus is similar to HNK corpus because users cannot download entire corpus but they can freely search it and use data which they collected from its search. All rights are reserved by Institute of Croatian Language and Linguistics. All three corpora provide free access to users for collecting, searching and using data from its language databases but only hrWaC corpus users can download completely for their use. Research was conducted with lecturers from Centre for Croatian as Foreign and Second Language (Croaticum) at Faculty of Humanities and Social Science, University of Zagreb. Selected lectures prepared their teaching materials using hrWaC corpus and its tools. After their preparation they introduce their students with corpus and presented them with materials which were corpus based. Research was conducted using two corpus teaching approaches: direct and indirect use of corpus in classroom 7 (6).

Results and Discussion: It has been proven that hrWaC, HNK and Croatian Language Corpus as language databases can be used for learning Croatian language through research which was conducted among lecturers at Croaticum. Lecturers can use them as open educational resources for preparing teaching materials, creating language exercise and exams for their students, for exploring different meaning of words, etc. Openness of corpus as language database can provide useful insights into authentic language use, frequency of vocabulary and collocations, different meaning of particular word depending on it surrounding content, explanation of local idioms, insight into local customs,habits and culture, etc.

Conclusion: Teachers and lecturers which uses corpora have more freedom in selecting educational and teaching materials, students have open access to corpus and its data and they can use it for language learning and for improving their language skills whenever they want. Information and materials are always accessible for both teachers and students without paying fee for using corpus.Because of it free use and access these corpuses are excellent example of open educational resources for language learning, preparing teaching materials, conducting linguistics researches and many more.With these in mind we can conclude that is achieved one of main goals of computer corpora – free,remote and unlimited access for searching language content (7) that is available for everyone.

Location: Date: September 21, 2018 Time: 11:40 - 11:55 Kristina Posavec, University of Zagreb Creative Commons Attribution 4.0 International License