PTicola

Increasing Computational Language Resources for Portuguese

Motivation

Building new datasets for Portuguese NLP tasks with low resources

Through the use of the Google Cloud Platform (GCP) Translation Products as our main resource, we aim to translate several datasets from English to Portuguese as a means to boost low-resource NLP tasks such as Temporal Information Extraction, Semantic Role Labelling, and Relation Extraction.

Improving the current State of the Art in the Portuguese language in different NLP tasks

Based on the newly developed datasets, we aim to build new models as a means to create new benchmarks in Portuguese NLP tasks where there is a lack of out of the box (OOB) models or where their effectiveness is significantly inferior when compared to their English language counterparts.

Outcomes

Datasets

10 Datasets Translated into Portuguese

Synthetic patient journey dataset in European Portuguese

Deceptive Content Detection in European Portuguese

Portuguese Language Variety Identification

English-European Portuguese Translation

Models

English-to-European Portuguese Translator

Portuguese Language Variety Identifier

MedLink: Clinical Case Retrieval and Ranking System

Sentiment Analysis for Portuguese

Named Entity Recognition for Portuguese

Fine-tuned Biomedical Translator

Scientific Publications

Enhancing Portuguese Varieties Identification with Cross-Domain Approaches

Hugo Sousa, Rúben Almeida, Purificação Silvano, Inês Cantante, Ricardo Campos, Alípio Jorge

Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence

Tradutor: Building a Variety Specific Translation Model

Hugo Sousa, Satya Almasian, Ricardo Campos, Alípio Jorge

Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence

MedLink: Retrieval and Ranking of Case Reports to assist Clinical Decision Making

Filipe Cunha, Nuno Guimarães, Alexandra Mendes, Ricardo Campos, Alípio Jorge

Proceedings of the 47th European Conference of Information Retrieval, 2025

Building annotation schemes for clinical narratives: human expert vs. LLM

Ana Luísa Fernandes, Purificação Silvano, Rita Rb-Silva, Nuno Guimarães, António Leal, Tahsir Ahmed Munna, Filipe Cunha, Ricardo Campos, Alípio Jorge

To be submitted to Text2Story 25 workshop, co-allocated with 47th European Conference of Information Retrieval, 2025

Can LLMs Generate a European Portuguese Patient Journey?

Tahsir Ahmed Munna, Ana Luísa Fernandes, Nuno Guimarães, Purificação Silvano, Alípio Jorge

To be submitted to Text2Story 25 workshop, co-allocated with 47th European Conference of Information Retrieval, 2025

Translated Datasets

TempEval-3.0 Platinum

Temporal Information Extraction

93 documents

TempEval-3.0 Aquaint

Temporal Information Extraction

SOTA Dataset

CoNLL-2003

NER + POS Tagging

1,393 documents

Multi-News

Multi-Document Abstractive Text Summarization

9,000 documents

German-LER

Legal-NER

66,723 documents

OLID

Hate Speech Detection

Benchmarking Dataset

Team

Alípio Jorge

Alípio Jorge

Professor and Researcher

Ricardo Campos

Ricardo Campos

Co-Investigator

Purificação Silvano

Purificação Silvano

PhD Researcher

Sérgio Nunes

Sérgio Nunes

PhD Researcher

António Leal

António Leal

PhD Researcher

Evelin Amorim

Evelin Amorim

PhD Researcher

Nuno Guimarães

Nuno Guimarães

PhD Researcher

Hugo Sousa

Hugo Sousa

PhD Student

Nana Yu

Nana Yu

PhD Student

Ana Filipa Pacheco

Ana Filipa Pacheco

MSc Student

Luís Filipe Cunha

Luís Filipe Cunha

PhD Student

Rúben Almeida

Rúben Almeida

MSc Researcher

Orlando Soares

Orlando Soares

BSc Student