-
Santorini Tweets July-August 2021
This dataset contains 225.501 tweets written by 141.277 users. These tweets are geolocated in Santorini, or they contain the word or the hashtag "santorini" in the text. They...-
ZIP
The resource: 'tweet_santorini.csv' is not accessible as guest user. You must login to access it!
-
ZIP
-
FANCY Dataset
(NLI) FANCY (FActivity, Negation, Common-sense, hYpernimy) is a new dataset with 4000 sentence pairs concerning complex linguistic phenomena such as factivity, negation,... -
SWH Filenames
A 69 GB dataset with ~2.3 billion strings representing deduplicated names of source code files collected by Software Heritage, the great library of source code...-
ZIP
The resource: 'SWH Filenames' is not accessible as guest user. You must login to access it!
-
ZIP
-
Italian Tourism Dataset
A set of users' comments crawled and scraped from two main touristic websites (Booking.com and Tripadvisor.com) related to main touristic point of interests in Italy and, in...-
HTML
The resource: 'tourism-dataset' is not accessible as guest user. You must login to access it!
-
HTML
-
-
ZIP
The resource: 'geo-annotated tweets.zip' is not accessible as guest user. You must login to access it!
-
ZIP
-
Wyscout soccer-logs dataset
A dataset of soccer-logs for all the main soccer leagues in the world, from season 2014/2015 to the current one. -
Soccer Events
This dataset contains data regarding one full season of soccer games. For each player there are locations (positions in pitch) visited and all the events they generated...-
ZIP
The resource: 'Soccer event data' is not accessible as guest user. You must login to access it!
-
ZIP
-
Introduction to Data Curation
This course is an introduction to data collection, data preparation & transformation and data analysis. It contains the essential concepts for a researcher in order to...-
PDF
The resource: 'Introduction to Data Curation' is not accessible as guest user. You must login to access it!
-
PDF
-
Python library for direct and indirect discrimination prevention in data mining
This python library implements the discrimination discovery and prevention method proposed in the paper: “A methodology for direct and indirect discrimination prevention in...-
GitHub
The resource: 'Link to library' is not accessible as guest user. You must login to access it!
-
GitHub
-
Ephemerality metric
https://github.com/HPAI-BSC/ephemerality Code for calculating the ephemerality metrics that can be used to estimate how "ephemeral" discussion topics are based on their...-
ZIP
The resource: 'ephemerality-main' is not accessible as guest user. You must login to access it!
-
ZIP
-
GSP - Geo-Semantic-Parsing
GSP receives a text document as input and returns an enriched document, where all mentions of places/locations are associated to the corresponding geographic coordinates. To... -
SMAPH Query Entity Linker
The SMAPH system links queries to the entities it mentions, disambiguating mentions if needed. Entities are Wikipedia pages. This problem is known as "entity recognition and...-
HTML
The resource: 'SMAPH documentation' is not accessible as guest user. You must login to access it!
-
HTML
-
Quantum Distance-Based Classifier
The Quantum Distance-Based Classifier is a technique inspired by the classical k-Nearest Neighbors that leverages quantum properties to perform prediction. -
The PGM-index a fully-dynamic compressed learned index with provable worst-ca...
We present the first learned index that supports predecessor, range queries and updates within provably efficient time and space bounds in the worst case. In the (static)...-
PDF
The resource: 'link to publication' is not accessible as guest user. You must login to access it!
-
PDF
-
CLiQS
CLiQS is a Python language software package for social media texts summarization with a diversified approach. -
Conversational search dataset with labels
CAsT 2019 data is split into two files one for training and the other one for testing. - Training set: CAsT 2019 conversations from training set and from test set without... -
Dataset for Evaluating Abstractive Summaries of Crisis-Related Social Media
The dataset created for evaluation of summaries generated from social media posted during five natural disasters. The dataset contains: ground truth reports created by human... -
Cross-Lingual Dataset of Crisis-Related Social Media
If you use this dataset, please cite the following paper: Fedor Vitiugin, Carlos Castillo: Cross-Lingual Query-Based Summarization of Crisis-Related Social Media: An Abstractive... -
Dictionary creator
This tool creates a dictionary with inverse document frequency (idf) values from the Google NGrams dataset.
