Sara, a definite tool to use the BNC

The British National Corpus has a wide collection of texts which users can easily consult using the tool linked to it: Sara . This is a tool with a unique feature: it can search not only words or phrases but other information contained in the texts compilated. It also allows to limit the search to particular categories or domains. But, how can we get and use Sara?

Given the wide amount of texts included in the BNC and the system of SGML tags it uses, we would need huge hard disks to store all the corpus in our computers plus a program to decode SGML. Therefore, there were two major problems that the creators of the BNC had to deal with: the impossibility of storing all the texts in a PC and the obtention of the SGML programmes -which are normaly free and bad or good but expensive.

The solution they found for the first question was to allow the use of the BNC installing the corpus in a server. For the second matter the included Sara in the website. You only have to install it in the computer and, when you try to execute it, it accesses to the server through the Internet.

The website includes some practical instructions to use Sara and to get started with it making it easier to find whatever information you are looking for. It also contains some examples and help with the creation of  word, phrase or part of speech queries. And then, it is ready to use!

 

Bibliography

 

British National Corpus Website, http://www.natcorp.ox.ac.uk/tools/index.xml, (Accessed 13/05/08 )

Pérez Guerra, Javier, “British National Corpus & Sara”, http://webs.uvigo.es/h04/jperez/bnc/, (Acessed 13/05/08 )

Wikipedia contributors, “British National Corpus,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=British_National_Corpus&oldid=203756426 (Accessed May 13, 2008 )

Publicado en on Mayo 13, 2008 at 11:25 am Comentarios (0)

Del.icio.us

We have been recently introduced to del.icio.us as a new tool. It is basically a collection of favourites which offers some advantages compared to the traditional favourites lists.

First of all del.icio.us allows the storage, classification and sharing of links to the websites of your preference. As it is extremely difficult to find the information you need among the huge amount of data that the Internet provides, it is very useful to have something that leads you wherever you want to go virtually in seconds. It also lets you filter the information to avoid uncomfortable useless websites.

Its social bookmarks manager allows to attach the sites you visit to your del.icio.us with a simple click of your mouse. And sites are also classified and ready to use. The tags make it easy and quick to classify as many things as you want. Besides, its complete and practical tools help you use and manage the resources.

 

All this makes del.cio.us a website with even more visits than the Wikipedia and with a potencial difficult to quantify.

 

Bibliography

Consumer contributors, Del.icio.us: Los favoritos de todos, Consumer.esEroski, 22/06/05, http://www.consumer.es/web/es/tecnologia/internet/2005/06/22/143141.php, (accessed 29/04/08 )

Wikipedia contributors, “Del.icio.us,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Del.icio.us&oldid=208689635 (accessed April 29, 2008 )

Publicado en on Abril 29, 2008 at 10:12 am Comentarios (0)

Will Machine Translation Replace Standard Translation?

Several years ago technology advanced enough to produce systems that could take words in one language and substitute them by their equivalent in a different language. It was the beginning of Machine Translation, and many feared it would substitute human translators. So far, programmes have improved but translator are still necessary. Will this change in a more or less distant future?

 

At the innitial moment, machines could only substitute words without interpreting them. The next step was to attempt mere complex texts and sequences of words. The corpora helped machines recognise phrases, translate idioms or identify types of words.  In this moment translation software limits the score of permited substitutions which makes systems much more effective. If the language is formulaic, as in legal documents for instance, the results are astonishing good but problems arise in more literary texts.

 

However, translators are still necessary. Machines cannot fully understand and translate some expressions, puns, idioms or simply the intention of the author. Besides, it is not always possible to get an exact equivalence from one language to another. Nowadays, it is common for human translators to start translating with machines and correct what they have done. But the final approach, the human touch, is only understood by other humans. So, unless a software that allows machines to think by themselves is invented machine translators will never replace people.

 

Bibliography

Lenssen, Philipp, “Google Translator: The Universal Language,” Google Blogoscoped, http://blogoscoped.com/archive/2005-05-22-n83.html , 2005, (accessed April 15, 2008 )

Wikipedia contributors, “Dictionary-based machine translation,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Dictionary-based_machine_translation&oldid=183145791 (accessed April 15, 2008 )

Wikipedia contributors, “Machine translation,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Machine_translation&oldid=205089471 (accessed April 15, 2008 )

Wikipedia contributors, “Statistical machine translation,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Statistical_machine_translation&oldid=202036737 (accessed April 15, 2008 )

Publicado en on Abril 15, 2008 at 10:05 am Comentarios (2)

English Corpora

There is a great number of corpora in English language. I have had a look at some of the most important an these are my conclusions:

  • British National Corpus (BNC): Covers British written and spoken English of the twentieth century. It contains a 100-million-word corpus of samples. It is marked up following  the TEI. It is distributed in XML format and XAIRA sofware.  When using it I found it a little complex but you can also use it in a much simpler way through the Davies website.
  • American National Corpus: It is related to American English. When completed it is aimed to be comparable to the BNC in number of texts and uses but this task is not finished yet. It contains lots of information but it is extremely easy to get lost in it. A tip: go to “resources”.
  • Oxford English Corpus: Created by the makers of the Oxford University Press language research programme and the Oxford English Dictionary. It contains two billion words from all types of literary sources. Each document includes interesting information about the author, gender, etc. It includes very good how-to-use explanations.
  • Bank of English: It was created by HarperCollins Publishers and the University of Birmingham. It is very useful because it explains what a corpus is.
  • Brown Corpus: It is the Brown University Standard Corpus of Present-Day American English. It is more limited than the others but contains a user-frienly manual of use.
  • Scottish Corpus Of Texts and Speech: It contains texts in Scotish English and variations of Scots. The search is quick and easy but the interest limited to Scotish studies.

Bibliography

Wikipedia contributors, “Text corpus,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Text_corpus&oldid=203756862 (Accessed April 8, 2008 )

Wynne, M (editor), “Developing Linguistic Corpora: a Guide to Good Practice” Oxford, 2005: Oxbow Books. Available online from http://ahds.ac.uk/linguistic-corpora/ (Accessed April 8, 2008 )

Publicado en on Abril 8, 2008 at 10:57 am Comentarios (0)

What the Hell the Corpus is?

Corpus

In our Digital Resourse class we have recently been asked to write a long report concerning the corpus. No problem so far except a small one. What is the corpus? I asked several classmates who were as lost as I was so I have finally decided to look seriously into the matter and I present you here what I have found out.

A corpus is basically a set of texts. They are gathered for people to consult about any doubts they may have, analyse different situations or get statistics on some particular cases or structures.

Corpora can be monolingual or plurilingual. The new technologies allow an electronic storage so, nowadays, the easiest way to get access to a corpus is the Internet. The corpora include a system of research known as annotation. This means that entries are classified with tags which make it easier to find a special application or topic. Tags include information as useful as type of word or the root where it comes from.

Fields such as computational linguistics, speech recognition or even machine translation work on the analysis of various types of corpora. There are several corpus of interest for linguistics students and researchers and some websites, like this one by AHDS that I find quite practical, which help you using them. They are increasing in use and importance and will soon become an indispensable tool for the analysis of the language.

I hope this information helps you in the ardous task of finally understanding what the hell the corpus is.

Bibliography

Wikipedia contributors, “Text corpus,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Text_corpus&oldid=203756862 (Accessed April 8, 2008 )

Wynne, M (editor), “Developing Linguistic Corpora: a Guide to Good Practice“, Oxford, 2005: Oxbow Books, Available online from http://ahds.ac.uk/linguistic-corpora/ (Accessed April 8, 2008 )

Publicado en on at 10:05 am Comentarios (0)

European Language Resources Asociation (ELRA)

elra.jpg

ELRA is an asociation whose aim is to grant language resources for language engineering and to evaluate its technologies. This way they provide resources for Human Language Technology (HLT). In order to do it, ELRA promotes and supports the development of a scientific field of language resources. They have an operational body ELDA (Evaluation & Language resources Distribution Agency).

ELRA was created when the need of providing language resources in large scale was evident. The European Commission decided then to start a programme called RELATOR, a consortium of reseachers working with the nine working languages of the European Union. Their highest achievement was the creation of ELRA.

ELRA works in the identification, production, promotion, validation, evaluation, distribution and standardisation of language resources or products related to them. They catalogue these language resources, offer legal assistance, organise conferences and evaluation campaigns and even create new resources on demand.

Any European or non-European organisation can join ELRA filling a form and paying an annual fee. Among the advantages of being a member we can list access to databases and legal assistance. Then you can buy from the catalogue and get discounts. If you do not find what you need in this catalogue you are invited to visit a universal catalogue or, if you do not find what you need, you can ask them to design it for you.

 

Bibliography

“ELRA Home Page”, http://www.elra.info/ (accessed 25/02/08 )

Publicado en on Febrero 22, 2008 at 10:40 am Comentarios (0)

Wikilengua

 wikipedia.jpg

La Wikilengua ha nacido como una herramienta que, siguiendo la estela de la Wikipedia, incluirá todo el saber sobre nuestro idioma a través de las contribuciones de sus usuarios. La novedad consiste en que en esta página las correcciones serán realizadas bajo la tutela de la RAE, lo que le dará un caracter más oficial. Más de un millón de personas la han visitado ya en sólo una semana de vida.

Bienvenida a la Wikilengua!

Publicado en on Enero 17, 2008 at 10:05 am Comentarios (0)

Creative Commons

 creative-commons.jpg

Creative Commons is a not lucrative organisation created with the purpose of expanding the creative work available for others legally to share. To do it they give lisences that permit their holders grant part of their copyright and, at the same time,  retain some rights. It includes metadata, portals, archives, blogs, and many other works. One of the projects involved in it is the omnipresent Wikipedia.

It was launched in 2001 and it’s headquarter is located in San Francisco. Although at the beginning there was no serious criticism about their work, it has been later been accused of not fulfilling all its objectives.  It has been said that it is an unconcerned corporate filter, that it simply takes away user’s rights and that it undermines copyright.

It is anyway a very practical and simple tool for the user. Just go and check it!

Note: This summary was taken from the wikipedia article but if you want to learn more about it you can go to the biblipgraphy.

Bibliography

Berry, David, “Is the Creative Commons missing something?”, Free Sofware Magazine, http://www.freesoftwaremagazine.com/articles/commons_without_commonality/(accessed January 8, 200 8)

Conhaim, Wallys W., “Creative Commons Nurtures the Public Domain”, Computers in Libraries”, http://newsbreaks.infotoday.com/nbreader.asp?ArticleID=17167, (accessed  January 8, 200 8)

Lessig, Laurence, “Creative Commons and the Remix Culture”, Talking with Talis, http://talk.talis.com/archives/2006/01/lawrence_lessig.html (accessed  January 8, 200 8)

Wikipedia contributors, “Creative Commons,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Creative_Commons&oldid=182909754 (accessed January 8, 200 8)

Publicado en on Enero 8, 2008 at 9:38 am Comentarios (0)

La Biblioteca Foral de Bizkaia

 biblioteca-foral.jpg

La biblioteca foral, recientemente remodelada e inaugurada, está situada en el centro de Bilbao. Su página web nos describe tanto las instalaciones como el fondo del que dispone, además de contener una visita virtual guiada y la posibilidad de consultar el catálogo on-line.

Destaca especialmente el fondo de reserva que consta de incunables, obra antigua y manuscritos. El incunable más antiguo data de circa 1472 y el incunable vasco, un misal impreso en Pamplona, de 1500. La obra antigua se compone de monografías, folletos, hojas sueltas y publicaciones periódicas impresas entre 1501 y 1800.

El edificio consta de cinco plantas en las que se encuentran las salas de consulta, estudios, lectura e investigaciones y dispone de un servicio de acceso a internet y nuevas tecnologías, hemeroteca, mediateca, unidad de publicaciones y conexión wifi. También pueden hacerse consultas relacionadas con bibliografía, heráldica, cartografía y grabados.
Es posible realizar visitas guiadas para conocer todas las posibilidades que este nuevo espacio ofrece a los ciudadanos.
Publicado en on Diciembre 19, 2007 at 11:32 am Comentarios (0)

Archivos Históricos

ahev.jpg

Los archivos históricos contienen documentos que permiten a estudiosos y al público en general acceder a informaciones cruciales para sus estudios y para mantener la memoria histórica de un pueblo. Aunque tradicionalmente los archivos requerían la presencia de los usuarios para poder realizar la consulta, estas posibilidades se han visto notablemente incrementadas con la llegada de las nuevas tecnologías que permiten, en la mayoría de los casos, consultar los fondos cómodamente en Internet. No obstante, cada archivo sigue una política diferente en cuanto al acceso a la documentación. En algunos casos no toda la información está disponible en la red y hay que acudir en persona para consultar algunos documentos de acceso restringido.

Pasamos a continuación a analizar algunos de estos archivos tanto de nuestro entorno como a nivel estatal.

Archivo Histórico Eclesiástico de Vizcaya

Contiene todo tipo de documentación relativa a cuestiones eclesiásticas de la diócesis de Vizcaya. Su personal asesora a los usuarios presenciales y facilita el acceso a las copias. Sólo una pequeña parte de los fondos del archivo están disponibles para su consulta on-line. Para el resto es necesario personarse en las dependencias de esta institución situada en el edificio del Seminario Mayor de Derio. El archivo también ofrece un servicio de reprografía que permite la realización de extractos y copias literales de los documentos.

Archivo General de Simancas

Su construcción fue iniciada por Carlos V y finalizada por Felipe II. Este archivo guarda toda la documentación producida por los organismos de gobierno de la monarquía hispánica desde la época de los Reyes Católicos (1475) hasta la entrada del Régimen Liberal (1834). Este archivo sólo permite la consulta presencial de documentos requiriendo la simple presentación del DNI. También cuenta con un servicio de reprografía. Proporciona acceso al Portal de Archivos Españoles (PARES).

Archivo General de Indias

Custodia los documentos relacionados con las instituciones de gobierno y administración de las colonias españolas en América y Asia. Estas instituciones son: el Consejo de Indias y Secretarías de Despacho, la Casa de la Contratación y los Consulados de Sevilla y Cádiz. Asimismo, se conservan otros fondos de instituciones de menor entidad e incluso de particulares relacionados con las colonias españolas. Permite la consulta en sala y cuenta con un servicio de reprografía.

Archivo Histórico Nacional

Se creó con la función de recoger la documentación de los órganos de la Administración del Estado que ya no tiene valor administrativo pero si valor histórico. También permite la consulta en sala, cuenta con un servicio de reprografía y expide certificados pero no permite la consulta on-line.

Cámara de Comptos de Navarra

La Cámara de Comptos de Navarra es el tribunal de cuentas más antiguo de España y uno de los más antiguos de Europa, fue la base del Archivo General de Navarra y hoy en día fiscaliza los fondos públicos de Navarra y asesora a su Parlamento en materias económico-financieras. Permite una visita virtual.

Publicado en on Diciembre 4, 2007 at 9:36 am Comentarios (0)