Sara, a definite tool to use the BNC

The British National Corpus has a wide collection of texts which users can easily consult using the tool linked to it: Sara . This is a tool with a unique feature: it can search not only words or phrases but other information contained in the texts compilated. It also allows to limit the search to particular categories or domains. But, how can we get and use Sara?

Given the wide amount of texts included in the BNC and the system of SGML tags it uses, we would need huge hard disks to store all the corpus in our computers plus a program to decode SGML. Therefore, there were two major problems that the creators of the BNC had to deal with: the impossibility of storing all the texts in a PC and the obtention of the SGML programmes -which are normaly free and bad or good but expensive.

The solution they found for the first question was to allow the use of the BNC installing the corpus in a server. For the second matter the included Sara in the website. You only have to install it in the computer and, when you try to execute it, it accesses to the server through the Internet.

The website includes some practical instructions to use Sara and to get started with it making it easier to find whatever information you are looking for. It also contains some examples and help with the creation of  word, phrase or part of speech queries. And then, it is ready to use!

 

Bibliography

 

British National Corpus Website, http://www.natcorp.ox.ac.uk/tools/index.xml, (Accessed 13/05/08 )

Pérez Guerra, Javier, “British National Corpus & Sara”, http://webs.uvigo.es/h04/jperez/bnc/, (Acessed 13/05/08 )

Wikipedia contributors, “British National Corpus,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=British_National_Corpus&oldid=203756426 (Accessed May 13, 2008 )

Publicado en on Mayo 13, 2008 at 11:25 am Comentarios (0)

Del.icio.us

We have been recently introduced to del.icio.us as a new tool. It is basically a collection of favourites which offers some advantages compared to the traditional favourites lists.

First of all del.icio.us allows the storage, classification and sharing of links to the websites of your preference. As it is extremely difficult to find the information you need among the huge amount of data that the Internet provides, it is very useful to have something that leads you wherever you want to go virtually in seconds. It also lets you filter the information to avoid uncomfortable useless websites.

Its social bookmarks manager allows to attach the sites you visit to your del.icio.us with a simple click of your mouse. And sites are also classified and ready to use. The tags make it easy and quick to classify as many things as you want. Besides, its complete and practical tools help you use and manage the resources.

 

All this makes del.cio.us a website with even more visits than the Wikipedia and with a potencial difficult to quantify.

 

Bibliography

Consumer contributors, Del.icio.us: Los favoritos de todos, Consumer.esEroski, 22/06/05, http://www.consumer.es/web/es/tecnologia/internet/2005/06/22/143141.php, (accessed 29/04/08 )

Wikipedia contributors, “Del.icio.us,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Del.icio.us&oldid=208689635 (accessed April 29, 2008 )

Publicado en on Abril 29, 2008 at 10:12 am Comentarios (0)

Will Machine Translation Replace Standard Translation?

Several years ago technology advanced enough to produce systems that could take words in one language and substitute them by their equivalent in a different language. It was the beginning of Machine Translation, and many feared it would substitute human translators. So far, programmes have improved but translator are still necessary. Will this change in a more or less distant future?

 

At the innitial moment, machines could only substitute words without interpreting them. The next step was to attempt mere complex texts and sequences of words. The corpora helped machines recognise phrases, translate idioms or identify types of words.  In this moment translation software limits the score of permited substitutions which makes systems much more effective. If the language is formulaic, as in legal documents for instance, the results are astonishing good but problems arise in more literary texts.

 

However, translators are still necessary. Machines cannot fully understand and translate some expressions, puns, idioms or simply the intention of the author. Besides, it is not always possible to get an exact equivalence from one language to another. Nowadays, it is common for human translators to start translating with machines and correct what they have done. But the final approach, the human touch, is only understood by other humans. So, unless a software that allows machines to think by themselves is invented machine translators will never replace people.

 

Bibliography

Lenssen, Philipp, “Google Translator: The Universal Language,” Google Blogoscoped, http://blogoscoped.com/archive/2005-05-22-n83.html , 2005, (accessed April 15, 2008 )

Wikipedia contributors, “Dictionary-based machine translation,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Dictionary-based_machine_translation&oldid=183145791 (accessed April 15, 2008 )

Wikipedia contributors, “Machine translation,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Machine_translation&oldid=205089471 (accessed April 15, 2008 )

Wikipedia contributors, “Statistical machine translation,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Statistical_machine_translation&oldid=202036737 (accessed April 15, 2008 )

Publicado en on Abril 15, 2008 at 10:05 am Comentarios (2)

English Corpora

There is a great number of corpora in English language. I have had a look at some of the most important an these are my conclusions:

  • British National Corpus (BNC): Covers British written and spoken English of the twentieth century. It contains a 100-million-word corpus of samples. It is marked up following  the TEI. It is distributed in XML format and XAIRA sofware.  When using it I found it a little complex but you can also use it in a much simpler way through the Davies website.
  • American National Corpus: It is related to American English. When completed it is aimed to be comparable to the BNC in number of texts and uses but this task is not finished yet. It contains lots of information but it is extremely easy to get lost in it. A tip: go to “resources”.
  • Oxford English Corpus: Created by the makers of the Oxford University Press language research programme and the Oxford English Dictionary. It contains two billion words from all types of literary sources. Each document includes interesting information about the author, gender, etc. It includes very good how-to-use explanations.
  • Bank of English: It was created by HarperCollins Publishers and the University of Birmingham. It is very useful because it explains what a corpus is.
  • Brown Corpus: It is the Brown University Standard Corpus of Present-Day American English. It is more limited than the others but contains a user-frienly manual of use.
  • Scottish Corpus Of Texts and Speech: It contains texts in Scotish English and variations of Scots. The search is quick and easy but the interest limited to Scotish studies.

Bibliography

Wikipedia contributors, “Text corpus,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Text_corpus&oldid=203756862 (Accessed April 8, 2008 )

Wynne, M (editor), “Developing Linguistic Corpora: a Guide to Good Practice” Oxford, 2005: Oxbow Books. Available online from http://ahds.ac.uk/linguistic-corpora/ (Accessed April 8, 2008 )

Publicado en on Abril 8, 2008 at 10:57 am Comentarios (0)

What the Hell the Corpus is?

Corpus

In our Digital Resourse class we have recently been asked to write a long report concerning the corpus. No problem so far except a small one. What is the corpus? I asked several classmates who were as lost as I was so I have finally decided to look seriously into the matter and I present you here what I have found out.

A corpus is basically a set of texts. They are gathered for people to consult about any doubts they may have, analyse different situations or get statistics on some particular cases or structures.

Corpora can be monolingual or plurilingual. The new technologies allow an electronic storage so, nowadays, the easiest way to get access to a corpus is the Internet. The corpora include a system of research known as annotation. This means that entries are classified with tags which make it easier to find a special application or topic. Tags include information as useful as type of word or the root where it comes from.

Fields such as computational linguistics, speech recognition or even machine translation work on the analysis of various types of corpora. There are several corpus of interest for linguistics students and researchers and some websites, like this one by AHDS that I find quite practical, which help you using them. They are increasing in use and importance and will soon become an indispensable tool for the analysis of the language.

I hope this information helps you in the ardous task of finally understanding what the hell the corpus is.

Bibliography

Wikipedia contributors, “Text corpus,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Text_corpus&oldid=203756862 (Accessed April 8, 2008 )

Wynne, M (editor), “Developing Linguistic Corpora: a Guide to Good Practice“, Oxford, 2005: Oxbow Books, Available online from http://ahds.ac.uk/linguistic-corpora/ (Accessed April 8, 2008 )

Publicado en on at 10:05 am Comentarios (0)

European Language Resources Asociation (ELRA)

elra.jpg

ELRA is an asociation whose aim is to grant language resources for language engineering and to evaluate its technologies. This way they provide resources for Human Language Technology (HLT). In order to do it, ELRA promotes and supports the development of a scientific field of language resources. They have an operational body ELDA (Evaluation & Language resources Distribution Agency).

ELRA was created when the need of providing language resources in large scale was evident. The European Commission decided then to start a programme called RELATOR, a consortium of reseachers working with the nine working languages of the European Union. Their highest achievement was the creation of ELRA.

ELRA works in the identification, production, promotion, validation, evaluation, distribution and standardisation of language resources or products related to them. They catalogue these language resources, offer legal assistance, organise conferences and evaluation campaigns and even create new resources on demand.

Any European or non-European organisation can join ELRA filling a form and paying an annual fee. Among the advantages of being a member we can list access to databases and legal assistance. Then you can buy from the catalogue and get discounts. If you do not find what you need in this catalogue you are invited to visit a universal catalogue or, if you do not find what you need, you can ask them to design it for you.

 

Bibliography

“ELRA Home Page”, http://www.elra.info/ (accessed 25/02/08 )

Publicado en on Febrero 22, 2008 at 10:40 am Comentarios (0)