Prof. Daniel Aloise

Selected Publications

Filter by type:

Sort by year:

An efficient implementation of a VNS heuristic for the weighted fair sequences problem

Caroline Rocha, Bruno J.S. Pessoa, Daniel Aloise, Lucidio A. Cabral

Journal Paper International Transactions in Operational Research, August 2022.

Abstract

In the weighted fair sequences problem (WFSP), one aims to schedule a set of tasks or activities sthat the maximum product between the largest temporal distance between two consecutive executions of a task and its priority is minimized. The WFSP covers a large number of applications in different areas, ranging from automobile production on a mixed-model assembly line to the sequencing of interactive applications to be aired in a digital TV environment. This paper proposes an iterative heuristic method for the WFSP centered on an efficient implementation of a variable neighborhood search heuristic. Computational experiments on benchmark instances show that the proposed metaheuristic outperforms the state-of-the-art method proposed to the problem, obtaining comparable solution values in much less computational time.

Data-driven Prioritization Strategies for Inventory Rebalancing in Bike-sharing systems

Maria Clara Martins Silva, Daniel Aloise, Sanjay D. Jena

Preprint Paper CIRRELT-2022-16

Abstract

The popularity of bike-sharing systems has constantly increased throughout the last years, given their convenience for users, low usage costs, health benefits and at their contribution to environmental relief. However, satisfying all user demands remains a challenge, given that the inventory of bike-sharing stations tends to be unbalanced over time. Bike-sharing system operators therefore explicitly rebalance station inventories in order to provide both available bikes and empty docks to the commuters. In most systems, the operator manually selects the stations and amounts of bikes to be rebalanced among those that are considered unbalanced. In practice, such manual planning is likely to result in suboptimal system performance. In this paper, we propose three variants of a machine learning-based algorithm to select the stations that should be prioritized for rebalancing, using features such as the predicted trip demand, as well as the inventory levels at the stations themselves and their surrounding stations. We evaluate the performance of these prioritization strategies by simulating real-world trips using data from 2019 and 2020, each of which exhibits distinct travel patterns given the restrictive measures implemented in 2020 to prevent the spread of COVID-19. One of the strategies significantly improves the system’s performance, reducing the lost demand up to 22% and the required rebalancing operations up to 12% when compared to the prioritization scheme currently used in practice. Finally, another strategy encourages the selection of stations that are geographically clustered, which may facilitate rebalancing operations afterwards.

FaST: A linear time stack trace alignment heuristic for crash report deduplication

Irving Muller Rodrigues, Daniel Aloise, Eraldo Rezende Fernandes

Conference Paper 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR)

Abstract

In software projects, applications are often monitored by systems that automatically identify crashes, collect their information into reports, and submit them to developers. Especially in popular applications, such systems tend to generate a large number of crash reports in which a significant portion of them are duplicate. Due to this high submission volume, in practice, the crash report deduplication is supported by devising automatic systems whose efficiency is a critical constraint. In this paper, we focus on improving deduplication system throughput by speeding up the stack trace comparison. In contrast to the state-of-the-art techniques, we propose FaST, a novel sequence alignment method that computes the similarity score between two stack traces in linear time. Our method independently aligns identical frames in two stack traces by means of a simple alignment heuristic. We evaluate FaST and five competing methods on four datasets from open-source projects using ranking and binary metrics. Despite its simplicity, FaST consistently achieves state-of-the-art performance regarding all metrics considered. Moreover, our experiments confirm that FaST is substantially more efficient than methods based on optimal sequence alignment.

On Transfer Learning for Building Damage Assessment from Satellite Imagery in Emergency Contexts

Isabelle Bouchard, Marie-Ève Rancourt, Daniel Aloise, Freddie Kalaitzis

Journal Paper Remote Sensing 14 (11), 2532

Abstract

When a natural disaster occurs, humanitarian organizations need to be prompt, effective, and efficient to support people whose security is threatened. Satellite imagery offers rich and reliable information to support expert decision-making, yet its annotation remains labour-intensive and tedious. In this work, we evaluate the applicability of convolutional neural networks (CNN) in supporting building damage assessment in an emergency context. Despite data scarcity, we develop a deep learning workflow to support humanitarians in time-constrained emergency situations. To expedite decision-making and take advantage of the inevitable delay to receive post-disaster satellite images, we decouple building localization and damage classification tasks into two isolated models. Our contribution is to show the complexity of the damage classification task and use established transfer learning techniques to fine-tune the model learning and estimate the minimal number of annotated samples required for the model to be functional in operational situations.

TraceSim: An Alignment Method for Computing Stack Trace Similarity

Irving Muller Rodrigues, Aleksandr Khvorov, Daniel Aloise, Roman Vasiliev, Dmitrij Koznov, Eraldo Rezende Fernandes, George Chernishev, Dmitry Luciv, Nikita Povarov

Journal Paper Empirical Software Engineering, Volume 27, Issue 2, March 2022, Pages 1–41

Abstract

Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.

A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering

Rodrigo Randel, Daniel Aloise, Alain Hertz, Simon Blanchard

Journal Paper Data Mining and Knowledge Discovery, Volume 35, Issue 6, November 2021, Pages 2341–2368

Abstract

Clustering algorithms help identify homogeneous subgroups from data. In some cases, additional information about the relationship among some subsets of the data exists. When using a semi-supervised clustering algorithm, an expert may provide additional information to constrain the solution based on that knowledge and, in doing so, guide the algorithm to a more useful and meaningful solution. Such additional information often takes the form of a cannot-link constraint (i.e., two data points cannot be part of the same cluster) or a must-link constraint (i.e., two data points must be part of the same cluster). A key challenge for users of such constraints in semi-supervised learning algorithms, however, is that the addition of inaccurate or conflicting constraints can decrease accuracy and little is known about how to detect whether expert-imposed constraints are likely incorrect. In the present work, we introduce a method to score each must-link and cannot-link pairwise constraint as likely incorrect. Using synthetic experimental examples and real data, we show that the resulting impact score can successfully identify individual constraints that should be removed or revised.

The Covering-Assignment Problem for Swarm-Powered Ad Hoc Clouds: A Distributed 3-D Mapping Usecase

Leandro R. Costa, Daniel Aloise, Luca G. Gianoli, Andrea Lodi

Journal Paper IEEE Internet of Things Journal, Volume 8, Issue 9, May 2021, Pages 7316--7332

Abstract

The popularity of drones is rapidly increasing across the different sectors of the economy. Aerial capabilities and relatively low costs make drones the perfect solution to improve the efficiency of operations that are typically carried out by humans. Besides automating field operations, drones acting de facto as a swarm can serve as an ad hoc cloud infrastructure built on top of computing and storage resources available across the swarm members and other elements. Even in the absence of Internet connectivity, this cloud can serve the workloads generated by the swarm members and the field agents. By considering the practical example of a swarm-powered 3-D reconstruction application on top of such cloud infrastructure, we present a new optimization problem for the efficient generation and execution of multinode computing workloads subject to data geolocation and clustering constraints. The objective is the minimization of the overall computing times, including both networking delays caused by the interdrone data transmission and computation delays. We prove that the problem is NP-hard and present two combinatorial formulations to model it. Computational results on the solution of the formulations show that one of them can be used to solve, within the configured time-limit, more than 50% of the considered real-world instances involving up to two hundred images and six drones.

Heuristics for optimizing 3D mapping missions over swarm-powered ad hoc clouds

Leandro R. Costa, Daniel Aloise, Luca G. Gianoli, Andrea Lodi

Journal Paper Journal of Heuristics, 2022

Abstract

Drones have been getting more and more popular in many economy sectors. Both scientific and industrial communities aim at making the impact of drones even more disruptive by empowering collaborative autonomous behaviors -- also known as swarming behaviors -- within fleets of multiple drones. In swarming-powered 3D mapping missions, unmanned aerial vehicles typically collect the aerial pictures of the target area whereas the 3D reconstruction process is performed in a centralized manner. However, such approaches do not leverage computational and storage resources from the swarm members.We address the optimization of a swarm-powered distributed 3D mapping mission for a real-life humanitarian emergency response application through the exploitation of a swarm-powered ad hoc cloud. Producing the relevant 3D maps in a timely manner, even when the cloud connectivity is not available, is crucial to increase the chances of success of the operation. In this work, we present a mathematical programming heuristic based on decomposition and a variable neighborhood search heuristic to minimize the completion time of the 3D reconstruction process necessary in such missions. Our computational results reveal that the proposed heuristics either quickly reach optimality or improve the best known solutions for almost all tested realistic instances comprising up to 1000 images and fifteen drones.

A Practical Survey on Faster and Lighter Transformers

Quentin Fournier, Gaétan Marceau Caron, Daniel Aloise

Preprint Paper arXiv preprint arXiv:2103.14636, March 2021

Abstract

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models' efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer's limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for the deep learning community to determine which methods to apply in practice to meet the desired trade-off between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make the Transformer faster and lighter and by providing a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions.

Exploring dual information in distance metric learning for clustering

Rodrigo Randel, Daniel Aloise, Alain Hertz

Preprint Paper arXiv preprint arXiv:2105.12703, May 2021

Abstract

Distance metric learning algorithms aim to appropriately measure similarities and distances between data points. In the context of clustering, metric learning is typically applied with the assist of side-information provided by experts, most commonly expressed in the form of cannot-link and must-link constraints. In this setting, distance metric learning algorithms move closer pairs of data points involved in must-link constraints, while pairs of points involved in cannot-link constraints are moved away from each other. For these algorithms to be effective, it is important to use a distance metric that matches the expert knowledge, beliefs, and expectations, and the transformations made to stick to the side-information should preserve geometrical properties of the dataset. Also, it is interesting to filter the constraints provided by the experts to keep only the most useful and reject those that can harm the clustering process. To address these issues, we propose to exploit the dual information associated with the pairwise constraints of the semi-supervised clustering problem. Experiments clearly show that distance metric learning algorithms benefit from integrating this dual information.

Soft Attention: Does it Actually Help to Learn Social Interactions in Pedestrian Trajectory Prediction?

Laurent Boucaud, Daniel Aloise, Nicolas Saunier

Preprint Paper arXiv preprint arXiv:2106.15321, June 2021

Abstract

We consider the problem of predicting the future path of a pedestrian using its motion history and the motion history of the surrounding pedestrians, called social information. Since the seminal paper on Social-LSTM, deep-learning has become the main tool used to model the impact of social interactions on a pedestrian's motion. The demonstration that these models can learn social interactions relies on an ablative study of these models. The models are compared with and without their social interactions module on two standard metrics, the Average Displacement Error and Final Displacement Error. Yet, these complex models were recently outperformed by a simple constant-velocity approach. This questions if they actually allow to model social interactions as well as the validity of the proof. In this paper, we focus on the deep-learning models with a soft-attention mechanism for social interaction modeling and study whether they use social information at prediction time. We conduct two experiments across four state-of-the-art approaches on the ETH and UCY datasets, which were also used in previous work. First, the models are trained by replacing the social information with random noise and compared to model trained with actual social information. Second, we use a gating mechanism along with a L0 penalty, allowing models to shut down their inner components. The models consistently learn to prune their soft-attention mechanism. For both experiments, neither the course of the convergence nor the prediction performance were altered. This demonstrates that the soft-attention mechanism and therefore the social information are ignored by the models.

Visual attractiveness in vehicle routing via bi-objective optimization

Diego Rocha, Daniel Aloise, Dario J. Aloise, Claudio Contardo

Journal Paper Computers & Operations Research, January 2022, Volume 137, Pages 105507

Abstract

We consider the problem of designing vehicle routes in a distribution system that are at the same time cost-effective and visually attractive. In this paper we argue that clustering, a popular data mining task, provides a good proxy for visual attractiveness. Our claim is supported by the proposal of a bi-objective capacitated vehicle routing problem in which, in addition to seek for traveling cost minimization, optimizes clustering criteria defined over the customers partitioned in the different routes. The model is solved by a multi-objective evolutionary algorithm to approximate its Pareto frontier. We show, by means of computational experiments, that our model is able to characterize vehicle routing solutions with low routing costs which are, at the same time, attractive according to the visual metrics proposed in the literature.

A Framework for Detecting System Performance Anomalies Using Tracing Data Analysis

Iman Kohyarnejadfard, Daniel Aloise, Michel R. Dagenais, Mahsa Shakeri

Journal Paper Entropy, Volume 23, Issue 8, August 2021, Pages 1011

Abstract

Advances in technology and computing power have led to the emergence of complex and large-scale software architectures in recent years. However, they are prone to performance anomalies due to various reasons, including software bugs, hardware failures, and resource contentions. Performance metrics represent the average load on the system and do not help discover the cause of the problem if abnormal behavior occurs during software execution. Consequently, system experts have to examine a massive amount of low-level tracing data to determine the cause of a performance issue. In this work, we propose an anomaly detection framework that reduces troubleshooting time, besides guiding developers to discover performance problems by highlighting anomalous parts in trace data. Our framework works by collecting streams of system calls during the execution of a process using the Linux Trace Toolkit Next Generation(LTTng), sending them to a machine learning module that reveals anomalous subsequences of system calls based on their execution times and frequency. Extensive experiments on real datasets from two different applications (e.g., MySQL and Chrome), for varying scenarios in terms of available labeled data, demonstrate the effectiveness of our approach to distinguish normal sequences from abnormal ones.

Heuristics for the dynamic facility location problem with modular capacities

Allyson Silva, Daniel Aloise, Leandro C. Coelho, Caroline Rocha

Journal Paper European Journal of Operational Research, Volume 290, Issue 2, April 2021, Pages 435--452

Abstract

This paper studies the Dynamic Facility Location Problem with Modular Capacities (DFLPM). It generalizes several facility location problems and consists in determining locations and sizes of facilities to minimize location and demand allocation costs with decisions taken periodically over a planning horizon. The DFLPM is solved using heuristics tailored for different scenarios and cost structures. We propose three linear relaxation based heuristics (LRH) and an evolutionary heuristic that hybridizes a genetic algorithm with a variable neighborhood descent (GA+VND). We adapt benchmark instances from the literature to yield several representations of scenarios and parameters structures. Experiments are reported comparing the heuristics to a state-of-the-art mixed integer programming (MIP) formulation for the problem. We show that the performance of the methods depends on the characteristics of the instance solved. For the benchmark instances, the LRH improved by VND finds solutions within 0.02% of the optimal ones in less than half of the time of the MIP. For the scenarios where construction costs are higher and module sizes are lower, the GA+VND proved to be effective to solve the problem, outperforming the LRH and the MIP. We also discuss the results from a practitioner point of view to identify situations where each method is preferable.

On Improving Deep Learning Trace Analysis with System Call Arguments

Quentin Fournier, Daniel Aloise, Seyed Vahid Azhari, François Tetreault

Conference Paper 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), March 2021, Pages 120–130

Abstract

Kernel traces are sequences of low-level events comprising a name and multiple arguments, including a timestamp, a process id, and a return value, depending on the event. Their analysis helps uncover intrusions, identify bugs, and find latency causes. However, their effectiveness is hindered by omitting the event arguments. To remedy this limitation, we introduce a general approach to learning a representation of the event names along with their arguments using both embedding and encoding. The proposed method is readily applicable to most neural networks and is task-agnostic. The benefit is quantified by conducting an ablation study on three groups of arguments: call-related, process-related, and time-related. Experiments were conducted on a novel web request dataset and validated on a second dataset collected on pre-production servers by Ciena, our partnering company. By leveraging additional information, we were able to increase the performance of two widely-used neural networks, an LSTM and a Transformer, by up to 11.3% on two unsupervised language modelling tasks. Such tasks may be used to detect anomalies, pre-train neural networks to improve their performance, and extract a contextual representation of the events.

A data-driven model for safety risk identification from flight data analysis

Rey Mickael, Daniel Aloise, François Soumis, Romanic Pieugueu

Journal Paper Transportation Engineering, Volume 5, September 2021, Pages 100087

Abstract

Most aviation accidents take place in the final phase of a flight. One possible accident is the runway overrun - the fact that an aircraft leaves the runway unexpectedly on landing. Even though such accidents are well documented and studied in the aviation industry, this paper aims at identifying less direct links between data recorded by planes and the risk of runway overrun, or linked events. Indeed, a better understanding of these events using available flight data helps to reduce their number. Nonetheless, such analysis is not straightforward given the massive volume of data collected during the flights. For that purpose, we propose a data-driven approach with the use of data analysis methods and machine learning tools. After a quick correlation analysis, a boosted tree classifier was trained to classify flights as safe or at risk. The classifications were accurate enough to extract contributing factors, and a more extensive analysis was conducted on multiple airports. That analysis revealed the importance of particular factors, leading to new insights about potential approaches to aviation safety.

An Exact CP Approach for the Cardinality-Constrained Euclidean Minimum Sum-of-Squares Clustering Problem

Mohammed Najib Haouas, Daniel Aloise, Gilles Pesant

Conference Paper International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, September 2020, Pages 256--272

Abstract

Clustering consists in finding hidden groups from unlabeled data which are as homogeneous and well-separated as possible. Some contexts impose constraints on the clustering solutions such as restrictions on the size of each cluster, known as cardinality-constrained clustering. In this work we present an exact approach to solve the Cardinality-Constrained Euclidean Minimum Sum-of-Squares Clustering Problem. We take advantage of the structure of the problem to improve several aspects of previous constraint programming approaches: lower bounds, domain filtering, and branching. Computational experiments on benchmark instances taken from the literature confirm that our approach improves our solving capability over previously-proposed exact methods for this problem.

Scaling Optimizations for Large-Scale Distributed Data with Lightweight Coresets

Daniel N. Pinheiro, Samuel Xavier-de-Souza, Daniel Aloise

Conference Paper 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2020, Pages 426--429

Abstract

Lightweight coresets are compact representations of data sets such that clustering methods present competitive results in relation to the complete data set. They are constructed by sampling important points from the complete set. We propose a fast method to approximate the sampling of lightweight coresets from very large data sets which are distributed among multiple machines. We show that the proposed method is much faster and scalable, reaching results 48 times faster than the original lightweight coresets, while holding similar properties.

RecSeats: A Hybrid Convolutional Neural Network Choice Model for Seat Recommendations at Reserved Seating Venues

Théo Moins, Daniel Aloise, Simon J. Blanchard

Conference Paper Fourteenth ACM Conference on Recommender Systems, September 2020, Pages 309--317

Abstract

Predicting locational choices (i.e., where one chooses to sit) is a challenging task because preferences are highly heterogeneous and depend not only on the location of the seats in the environment but also on the location of others. In the present research, we propose RecSeats - a framework to predict locational choices. The framework augments individual-level discrete choice models with a convolutional neural network (CNN) which can capture higher order interactions between features of available seats. The framework is flexible and can accommodate complexity in real-world locational choice data such as variability in the number of tickets purchased and the number and locations from past purchases. Applied to both locational choice experiment data and to ticketing data from a large North-American concert hall, we show that augmenting individual-level discrete choice models with a CNN consistently provides strong predictive accuracy.

A Soft Alignment Model for Bug Deduplication

Irving Muller Rodrigues, Daniel Aloise, Eraldo Rezende Fernandes, Michel Dagenais

Conference Paper Proceedings of the 17th International Conference on Mining Software Repositories, June 2020, Pages 43--53

Abstract

Bug tracking systems (BTS) are widely used in software projects. An important task in such systems consists of identifying duplicate bug reports, i.e., distinct reports related to the same software issue. For several reasons, reporting bugs that have already been reported is quite frequent, making their manual triage impractical in large BTSs. In this paper, we present a novel deep learning network based on soft-attention alignment to improve duplicate bug report detection. For a given pair of possibly duplicate reports, the attention mechanism computes interdependent representations for each report, which is more powerful than previous approaches. We evaluate our model on four well-known datasets derived from BTSs of four popular open-source projects. Our evaluation is based on a ranking-based metric, which is more realistic than decision-making metrics used in many previous works. Achieved results demonstrate that our model outperforms state-of-the-art systems and strong baselines in different scenarios. Finally, an ablation study is performed to confirm that the proposed architecture improves the duplicate bug reports detection.

Multi-level host-based intrusion detection system for Internet of things

Robin Gassais, Naser Ezzati-Jivan, Jose M. Fernandez, Daniel Aloise, Michel R. Dagenais

Journal Paper Journal of Cloud Computing, Volume 9, Issue 1, November 2020, Pages 1--16

Abstract

The growth of the Internet of things (IoT) has ushered in a new area of inter-connectivity and innovation in the home. Many devices, once separate, can now be interacted with remotely, improving efficiency and organization. This, however, comes at the cost of rising security vulnerabilities. Vendors are competing to create and release quickly innovative connected objects, without focusing on the security issues. As a consequence, attacks involving smart devices, or targeting them, are proliferating, creating threats to user’s privacy and even their physical security. Additionally, the heterogeneous technologies involved in IoT make attempts to develop protection on smart devices much harder. Most of the intrusion detection systems developed for those platforms are based on network activity. However, on many systems, intrusions cannot easily or reliably be detected from network traces. We propose a novel host-based automated framework for intrusion detection. Our work combines user space and kernel space information and machine learning techniques to detect various kinds of intrusions in smart devices. Our solution use tracing techniques to automatically get devices behavior, process this data into numeric arrays to train several machine learning algorithms, and raise alerts whenever an intrusion is found. We implemented several machine learning algorithms, including deep learning ones, to achieve high detection capabilities, while adding little overhead on the monitored devices. We tested our solution within a realistic home automation system with actual threats.

Convex fuzzy k-medoids clustering

Daniel N. Pinheiro, Daniel Aloise, Simon J. Blanchard

Journal Paper Fuzzy Sets and Systems, Volume 389, Issue 1, June 2020, Pages 66--92

Abstract

K-medoids clustering is among the most popular methods for cluster analysis despite its use requiring several assumptions about the nature of the latent clusters. In this paper, we introduce the Convex Fuzzy k-Medoids (CFKM) model, which not only relaxes the assumption that objects must be assigned entirely to one and only one medoid, but also that medoids must be assigned entirely to one and only one cluster. The resulting model is convex, thus its resolution is completely robust to initialization. To illustrate the usefulness of the CFKM model, we compare it with two fuzzy k-medoids clustering models: the Fuzzy k-Medoids (FKM) and the Fuzzy Clustering with Multi-Medoids (FMMdd), both solved approximately by heuristics because of their hard computational complexity. Our experiments with both synthetic and real-world data as well as a user survey reveal that the model is not only more robust to the choice hyperparameters of the fuzzy clustering task, but also that it can uniquely discover important aspects of data inherently fuzzy in nature.

Automatic cause detection of performance problems in web applications

Quentin Fournier, Naser Ezzati-Jivan, Daniel Aloise, Michel R. Dagenais

Conference Paper IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), October 2019, Pages 398--405

Abstract

The execution of similar units can be compared by their internal behaviors to determine the causes of their potential performance issues. For instance, by examining the internal behaviors of different fast or slow web requests more closely, and by clustering and comparing their internal executions, one can determine what causes some requests to run slowly or behave in unexpected ways. In this paper, we propose a method of extracting the internal behavior of web requests as well as introduce a pipeline that detects performance issues in web requests and provides insights into their root causes. First, low-level and fine-grained information regarding each request is gathered by tracing both the user space and the kernel space. Second, further information is extracted and fed into an outlier detector. Finally, these outliers are then clustered by their behavior, and each group is analyzed separately. Experiments revealed that this pipeline is indeed able to detect slow web requests and provide additional insights into their true root causes. Notably, we were able to identify a real PHP cache contention issue using the proposed approach.

System performance anomaly detection using tracing data analysis

Iman Kohyarnejadfard, Mahsa Shakeri, Daniel Aloise

Conference Paper Proceedings of the 5th International Conference on Computer and Technology Applications, April 2019, Pages 169--173

Abstract

In recent years, distributed systems have become increasingly complex as they grow in both scale and functionality. Such complexity makes these systems prone to performance anomalies. Efficient anomaly detection frameworks enable rapid recovery mechanisms to increase the system's reliability. In this paper, we present an anomaly detection approach for practical monitoring of processes running on a system to detect anomalous vectors of system calls. Our proposed methodology employs a Linux tracing toolkit (LTTng) to monitor the processes running on a system and extracts the streams of system calls. The system calls streams are split into short sequences using a sliding window strategy. Unlike previous studies, our proposed approach computes the execution time of system calls in addition to the frequency of each individual call in a window. Finally, a multi-class support vector machine approach is applied to evaluate the performance of the system and detect the anomalous sequences. A comprehensive experimental study on a real dataset collected using LTTng demonstrates that our proposed method is able to distinguish normal sequences from anomalous ones with CPU or memory related problems.

Empirical comparison between autoencoders and traditional dimensionality reduction methods

Quentin Fournier, Daniel Aloise

Conference Paper IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), June 2019, Pages 211--214

Abstract

In order to process efficiently ever-higher dimensional data such as images, sentences, or audio recordings, one needs to find a proper way to reduce the dimensionality of such data. In this regard, SVD-based methods including PCA and Isomap have been extensively used. Recently, a neural network alternative called autoencoder has been proposed and is often preferred for its higher flexibility. This work aims to show that PCA is still a relevant technique for dimensionality reduction in the context of classification. To this purpose, we evaluated the performance of PCA compared to Isomap, a deep autoencoder, and a variational autoencoder. Experiments were conducted on three commonly used image datasets: MNIST, Fashion-MNIST, and CIFAR-10. The four different dimensionality reduction techniques were separately employed on each dataset to project data into a low-dimensional space. Then a k-NN classifier was trained on each projection with a cross-validated random search over the number of neighbours. Interestingly, our experiments revealed that k-NN achieved comparable accuracy on PCA and both autoencoders projections provided a big enough dimension. However, PCA computation time was two orders of magnitude faster than its neural network counterparts.

A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming

Demetrios Coutinho, Felipe O. Silva, Daniel Aloise, Samuel Xavier-de-Souza

Preprint Paper arXiv preprint arXiv:1804.04737, May 2019

Abstract

The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the conception of this algorithm. In this paper, we present a shared-memory parallel implementation of the Simplex tableau algorithm for dense large-scale Linear Programming (LP) problems for use in modern multi-core architectures. We present the general scheme and explain the strategies taken to parallelize each step of the standard simplex algorithm, emphasizing the solutions found to solve performance bottlenecks. We analyzed the speedup and the parallel efficiency for the proposed implementation relative to the standard Simplex algorithm using a shared-memory system with 64 processing cores. The experiments were performed for several different problems, with up to 8192 variables and constraints, in their primal and dual formulations. The results show that the performance is mostly much better when we use the formulation with more variables than inequality constraints. Also, they show that the parallelization strategies applied to avoid bottlenecks lead the implementation to scale well with the problem size and the core count up to a certain limit of problem size. Further analysis showed that this scaling limit was an effect of resource limitation. Even though, our implementation was able to reach speedups in the order of 19x.

Ascent--descent variable neighborhood decomposition search for community detection by modularity maximization

Dušan Džamić, Daniel Aloise, Nenad Mladenović

Journal Paper Annals of Operations Research, Volume 272, Issue 1-2, June 2019, Pages 273--287

Abstract

In this paper we propose a new variant of the Variable Neighborhood Decomposition Search (VNDS) heuristic for solving global optimization problems and apply it in detecting clusters in complex community networks. The most used criterion for detecting comunities in networks, despite some recent criticism, is modularity maximization. Experimental results show that the proposed new heuristic outperforms the best algorithms found in the literature.

The Weighted Fair Sequences Problem

Bruno Jefferson de S. Pessoa, Daniel Aloise, Lucidio A. F. Cabral

Journal Paper Computers & Operations Research, Volume 91, March 2018, Pages 121--131

Abstract

Scheduling problems on which constraints are imposed with regard to the temporal distances between successive executions of the same task have numerous applications, ranging from task scheduling in real-time systems to automobile production on a mixed-model assembly line. This paper introduces a new NP-hard optimization problem belonging to this class of problems, namely the Weighted Fair Sequences Problem (WFSP). We present a mathematical formulation for the WFSP based on mixed-integer linear programming (MILP) as well as a series of cuts to improve its resolution via exact methods. Finally, we propose a heuristic solution method that works with much less variables of the WFSP formulation. The reported computational experiments show that, for a given time horizon, the proposed MILP-based heuristic increases the size of WFSP instances that can be tackled in practice. Moreover, its results should be considered as optimal whether a presented conjecture on the WFSP problem is proved true in the future.

Efficient heuristics for the minimum labeling global cut problem

Thiago Gouveia da Silva, Gilberto F. de Sousa Filho, Igor A. M. Barbosa, Nenad Mladenovic, Lucidio A. F. Cabral, Luiz Satoru Ochi, Daniel Aloise

Journal Paper Electronic Notes in Discrete Mathematics, Volume 66, April 2018, Pages 23--30

Abstract

Let be an edge-labeled graph. Let V be the set of vertices of G, E the set of edges, L the set of labels (colors) such that each edge as an associated label The goal of the minimum labeling global cut problem (MLGCP) is to find a subset of labels such that is not connected and is minimized. In this work, we generate random instances for the MLGCP to perform empirical tests. Also propose a set of heuristics using concepts of Genetic Algorithm and metaheuristic VNS, including some of their procedures, like two local search moves, and an auxiliary data structure to speed up the local search. Computational experiments show that the metaheuristic VNS outperforms other methods with respect to solution quality.

A review of dynamic data envelopment analysis: State of the art and applications

Fernanda B. A. R. Mariz, Mariana R. Almeida, Daniel Aloise

Journal Paper International Transactions in Operational Research, Volume 25, Issue 2, November 2018, Pages 469--505

Abstract

This article reports the evolution of the literature on Dynamic Data Envelopment Analysis (DDEA) models from 1996 to 2016. Systematic searches in the databases Scopus and Web of Science were performed to outline the state of the art. The results enabled the establishment of DDEA studies as the scope of this article, analyzing the transition elements to represent temporal interdependence. The categorization of these studies enabled the mapping of the evolution of the DDEA literature and identification of the relationships between models. The three most widely adopted studies to conduct DDEA research were classified as structuring models. Mapping elucidated the literature behavior through three phases and showed an increase in publications with applications in recent years. The analysis of applications indicated that most studies address evaluations in the agriculture and farming, banking and energy sectors and consider the facilities as transition elements between analysis periods.

A sampling-based exact algorithm for the solution of the minimax diameter clustering problem

Daniel Aloise, Claudio Contardo

Journal Paper Journal of Global Optimization, Volume 71, Issue 3, March 2018, Pages 613--630

Abstract

We consider the problem of clustering a set of points so as to minimize the maximum intra-cluster dissimilarity, which is strongly NP-hard. Exact algorithms for this problem can handle datasets containing up to a few thousand observations, largely insufficient for the nowadays needs. The most popular heuristic for this problem, the complete-linkage hierarchical algorithm, provides feasible solutions that are usually far from optimal. We introduce a sampling-based exact algorithm aimed at solving large-sized datasets. The algorithm alternates between the solution of an exact procedure on a small sample of points, and a heuristic procedure to prove the optimality of the current solution. Our computational experience shows that our algorithm is capable of solving to optimality problems containing more than 500,000 observations within moderate time limits, this is two orders of magnitude larger than the limits of previous exact methods.

Parallel synchronous and asynchronous coupled simulated annealing

Kayo Gonçalves-e-Silva, Daniel Aloise, Samuel Xavier-de-Souza

Journal Paper The Journal of Supercomputing, Volume 74, Issue 6, March 2018, Pages 2841--2869

Abstract

We propose a parallel synchronous and asynchronous implementation of the coupled simulated annealing (CSA) algorithm in a shared-memory architecture. The original CSA was implemented synchronously in a distributed-memory architecture. It synchronizes at each temperature update, which leads to idling and loss of efficiency when increasing the number of processors. The proposed synchronous CSA (SCSA) is implemented as the original, but in a shared-memory architecture. The proposed asynchronous CSA (ACSA) does not synchronize, allowing a larger parallel efficiency for larger numbers of processors. Results from extensive experiments show that the proposed ACSA presents much better quality of solution when compared to the serial and to the SCSA. The experiments also show that the performance of the proposed ACSA is better than the SCSA for less computationally intensive problems or when a larger number of processing cores are available. Moreover, the parallel efficiency of the ACSA improves by increasing the size of the problem. With the advent of the Multi-core Era, the use of the proposed algorithm becomes more attractive than the original synchronous CSA.

Less is more: simplified Nelder-Mead method for large unconstrained optimization

Daniel Aloise, Samuel Xavier-De-Souza, Nenad Mladenovi{\'c}, others others

Journal Paper Yugoslav Journal of Operations Research, Volume 28, Issue 2, March 2018, Pages 153--169

Abstract

Nelder-Mead method (NM) for solving continuous non-linear optimization problem is probably the most cited and the most used method in the optimization literature and in practical applications, too. It belongs to the direct search methods, those which do not use the first and the second order derivatives. The popularity of NM is based on its simplicity. In this paper we propose even more simple algorithm for larger instances that follows NM idea. We call it Simplified NM (SNM): instead of generating all n + 1 simplex points in Rn, we perform search using just q + 1 vertices, where q is usually much smaller than n. Though the results cannot be better than after performing calculations in n+1 points as in NM, significant speed-up allows to run many times SNM from different starting solutions, usually getting better results than those obtained by NM within the same cpu time. Computational analysis is performed on 10 classical convex and non-convex instances, where the number of variables n can be arbitrarily large. The obtained results show that SNM is more effective than the original NM, confirming that LIMA yields good results when solving a continuous optimization problem.

Towards Station-Level Demand Prediction for Effective Rebalancing in Bike-Sharing Systems

Pierre Hulot, Daniel Aloise, Sanjay Dominik Jena

Conference Paper Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, July 2018, Pages 378--386

Abstract

Bike sharing systems continue gaining worldwide popularity as they offer benefits on various levels, from society to environment. Given that those systems tend to be unbalanced along time, bikes are typically redistributed throughout the day to better meet the demand. Reasonably accurate demand prediction is key to effective redistribution; however, it is has received only little attention in the literature. In this paper, we focus on predicting the hourly demand for demand rentals and returns at each station of the system. The proposed model uses temporal and weather features to predict demand mean and variance. It first extracts the main traffic behaviors from the stations. These simplified behaviors are then predicted and used to perform station-level predictions based on machine learning and statistical inference techniques. We then focus on determining decision intervals, which are often used by bike sharing companies for their online rebalancing operations. Our models are validated on a two-year period of real data from BIXI Montréal. A worst-case analysis suggests that the intervals generated by our models may decrease unsatisfied demands by 30% when compared to the current methodology employed in practice.

Review of basic local searches for solving the minimum sum-of-squares clustering problem

Thiago Pereira, Daniel Aloise, Jack Brimberg, Nenad Mladenović

Book Chapter Open Problems in Optimization and Data Analysis, Volume 141, December 2018, Pages 249--270

Abstract

This paper presents a review of the well-known K-means, H-means, and J-means heuristics, and their variants, that are used to solve the minimum sum-of-squares clustering problem. We then develop two new local searches that combine these heuristics in a nested and sequential structure, also referred to as variable neighborhood descent. In order to show how these local searches can be implemented within a metaheuristic framework, we apply the new heuristics in the local improvement step of two variable neighborhood search (VNS) procedures. Computational experiments are carried out which suggest that this new and simple application of VNS is comparable to the state of the art. In addition, a very significant improvement (over 30%) in solution quality is obtained for the largest problem instance investigated containing 85,900 entities.

On the k-medoids model for semi-supervised clustering

Rodrigo Randel, Daniel Aloise, Nenad Mladenović, Pierre Hansen

Conference Paper International Conference on Variable Neighborhood Search, Volume 11328, March 2018, Pages 13--27

Abstract

Clustering is an automated and powerful technique for data analysis. It aims to divide a given set of data points into clusters which are homogeneous and/or well separated. A major challenge with clustering is to define an appropriate clustering criterion that can express a good separation of data into homogeneous groups such that the obtained clustering solution is meaningful and useful to the user. To circumvent this issue, it is suggested that the domain expert could provide background information about the dataset, which can be incorporated by a clustering algorithm in order to improve the solution. Performing clustering under this assumption is known as semi-supervised clustering. This work explores semi-supervised clustering through the k-medoids model. Results obtained by a Variable Neighborhood Search (VNS) heuristic show that the k-medoids model presents classification accuracy compared to the traditional k-means approach. Furthermore, the model demonstrates high flexibility and performance by combining kernel projections with pairwise constraints.

Extracting summary piles from sorting task data

Simon J. Blanchard, Daniel Aloise, Wayne S. DeSarbo

Journal Paper Journal of Marketing Research, Volume 54, Issue 3, June 2017, Pages 398--414

Abstract

In a sorting task, consumers receive a set of representational items (e.g., products, brands) and sort them into piles such that the items in each pile “go together.” The sorting task is flexible in accommodating different instructions and has been used for decades in exploratory marketing research in brand positioning and categorization. However, no general analytic procedures yet exist for analyzing sorting task data without performing arbitrary transformations to the data that influence the results and insights obtained. This manuscript introduces a flexible framework for analyzing sorting task data, as well as a new optimization approach to identify summary piles, which provide an easy way to explore associations consumers make among a set of items. Using two Monte Carlo simulations and an empirical application of single-serving snacks from a local retailer, the authors demonstrate that the resulting procedure is scalable, can provide additional insights beyond those offered by existing procedures, and requires mere minutes of computational time.

NP-Hardness of balanced minimum sum-of-squares clustering

Artem Pyatkin, Daniel Aloise, Nenad Mladenović

Journal Paper Pattern Recognition Letters, Volume 97, October 2017, Pages 44--45

Abstract

The balanced clustering problem consists of partitioning a set of n objects into K equal-sized clusters as long as n is a multiple of K. A popular clustering criterion when the objects are points of a q-dimensional space is the minimum sum of squared distances from each point to the centroid of the cluster to which it belongs. We show in this paper that this problem is NP-hard in general dimension already for triplets, i.e., when n/k =3.

A polynomial-time algorithm for the discrete facility location problem with limited distances and capacity constraints

Isaac F. Fernandes, Daniel Aloise, Dario J. Aloise, Thiago P. Jeronimo

Journal Paper Brazilian Journal of Operations & Production Management, Volume 14, Issue 1, July 2017, Pages 136--144

Abstract

The objective in terms of the facility location problem with limited distances is to minimize the sum of distance functions from the facility to its clients, but with a limit on each of these distances, from which the corresponding function becomes constant. The problem is applicable in situations where the service provided by the facility is insensitive after given threshold distances. In this paper, we propose a polynomial-time algorithm for the discrete version of the problem with capacity constraints regarding the number of served clients. These constraints are relevant for introducing quality measures in facility location decision processes as well as for justifying the facility creation.

On strategies to fix degenerate k-means solutions

Daniel Aloise, Nielsen Castelo Damasceno, Nenad Mladenović, Daniel Nobre Pinheiro

Journal Paper Journal of Classification, Volume 34, Issue 2, June 2017, Pages 165--190

Abstract

k-means is a benchmark algorithm used in cluster analysis. It belongs to the large category of heuristics based on location-allocation steps that alternately locate cluster centers and allocate data points to them until no further improvement is possible. Such heuristics are known to suffer from a phenomenon called degeneracy in which some of the clusters are empty. In this paper, we compare and propose a series of strategies to circumvent degenerate solutions during a k-means execution. Our computational experiments show that these strategies are effective, leading to better clustering solutions in the vast majority of the cases in which degeneracy appears in k-means. Moreover, we compare the use of our fixing strategies within k-means against the use of two initialization methods found in the literature. These results demonstrate how useful the proposed strategies can be, specially inside memorybased clustering algorithms.

Less is more: basic variable neighborhood search heuristic for balanced minimum sum-of-squares clustering

Leandro R. Costa, Daniel Aloise, Nenad Mladenović

Journal Paper Information Sciences, Volume 415, November 2017, Pages 247--253

Abstract

Clustering addresses the problem of finding homogeneous and well-separated subsets, called clusters, from a set of given data points. In addition to the points themselves, in many applications, there may exist constraints regarding the size of the clusters to be found. Particularly in balanced clustering, these constraints impose that the entities be equally spread among the different clusters. In this work, we present a basic variable neighborhood search heuristic for balanced minimum sum-of-squares clustering, following the recently proposed “Less Is More Approach”. Computational experiments and statistical tests show that the proposed algorithm outperforms the current state-of-the-art algorithm for the problem, indicating that non sophisticated and easy to implement metaheuristic methods can be sufficient to produce successful results in practice.

A model for clustering data from heterogeneous dissimilarities

Éverton Santi, Daniel Aloise, Simon J. Blanchard

Journal Paper European Journal of Operational Research, Volume 253, Issue 3, September 2016, Pages 659--672

Abstract

Clustering algorithms partition a set of n objects into p groups (called clusters), such that objects assigned to the same groups are homogeneous according to some criteria. To derive these clusters, the data input required is often a single n × n dissimilarity matrix. Yet for many applications, more than one instance of the dissimilarity matrix is available and so to conform to model requirements, it is common practice to aggregate (e.g., sum up, average) the matrices. This aggregation practice results in clustering solutions that mask the true nature of the original data. In this paper we introduce a clustering model which, to handle the heterogeneity, uses all available dissimilarity matrices and identifies for groups of individuals clustering objects in a similar way. The model is a nonconvex problem and difficult to solve exactly, and we thus introduce a Variable Neighborhood Search heuristic to provide solutions efficiently. Computational experiments and an empirical application to perception of chocolate candy show that the heuristic algorithm is efficient and that the proposed model is suited for recovering heterogeneous data. Implications for clustering researchers are discussed.

A New Global Optimization Algorithm for Diameter Minimization Clustering

Daniel Aloise, Claudio Contardo

Conference Paper XIII GLOBAL OPTIMIZATION WORKSHOP, GOW’16, Volume 16, September 2016, Pages 171--174

Abstract

We introduce an iterative algorithm for the solution of the diameter minimization clustering prob- lem. Our algorithm is based upon two observations: (i) subsets induce lower bounds on the value of the optimal solution of the original problem; and (ii) there exists a subset whose optimal clustering has the same value as that of the original problem. Computational experiments show that our algo- rithm takes two seconds or less to solve all tested problems, including one with 5000 points which was previously solved in almost a minute by the current state-of-the-art constraint programming method found in the literature.

A derivative-free algorithm for refining numerical microaggregation solutions

Daniel Aloise, Arthur Araújo

Journal Paper International Transactions in Operational Research, Volume 22, Issue 4, November 2015, Pages 693--712

Abstract

The biggest challenge in disclosing private data is to share information contained in databases, while protecting people from being individually identified. Microaggregation is a family of methods for statistical disclosure control. The principle of microaggregation states that confidentiality rules permit the publication of individual records if they are divided into groups of g or more data, where none of the records is more representative than the others in the same group. The application of such rules leads to replacing individual values by those computed from small groups, which are denoted as microaggregates, before data publication. The microaggregation procedure should be developed to reduce the information loss caused by this replacement process to an extent, so that valuable and accurate information could still be extracted from the microaggregated data. This work proposes a derivative-free optimization algorithm for the numerical microaggregation problem based on the evaluation of a simplified surrogate function. Computational experiments show that the proposed method can considerably improve the results obtained by other heuristics found in the literature, as it improves some of its best-known results.

Improved Variable Neighborhood Decomposition Search for Communitu Detection by Modularity Maximization

Nenad Mladenović, Daniel Aloise, Dragan Urošević, Dušan Džamić

Conference Paper XLII International Symposium on Operational Research, September 2015

Abstract

The analysis of complex networks is playing an important role in computer science, biology and social sciences, among other fields. Many real life networks like communication networks, biological and social networks display community structure which identifies groups of nodes within which connections are denser than between them. Community detection is an interdisciplinary subject with a vast spectrum of applications that has attracted the interest of many researchers in various fields in the past few years. The most used criterion for that purpose, despite some recent criticism, is modularity maximization, proposed by Newman and Girvan. In this paper we present a Variable Neighborhood Decomposition Search (VNDST) for solving the modularity maximization problem, with addition mechanism to overcome the local maximum. The performance of our VNDST algorithm we evaluate on the well-known set of instances of clustering problems from the 10th DIMACS Implementation Challenge. Experimental results show that the proposed VNDST algorithm outperforms other algorithms from the literature.

A simple and effective genetic algorithm for the two-stage capacitated facility location problem

Diogo R. M. Fernandes, Caroline Rocha, Daniel Aloise, Glaydston M. Ribeiro, Enilson M. Santos, Allyson Silva

Journal Paper Computers & Industrial Engineering, Volume 75, September 2014, Pages 200--208

Abstract

This paper presents a simple and effective Genetic Algorithm (GA) for the two-stage capacitated facility location problem (TSCFLP). The TSCFLP is a typical location problem which arises in freight transportation. In this problem, a single product must be transported from a set of plants to meet customers demands, passing out by intermediate depots. The objective is to minimize the operation costs of the underlying two-stage transportation system thereby satisfying demand and capacity constraints of its agents. For this purpose, a GA is proposed and computational results are reported comparing the heuristic results with those obtained by two state-of-the-art Lagrangian heuristics proposed in the literature for the problem.

Reactive search strategies using reinforcement learning, local search algorithms and variable neighborhood search

João Paulo Queiroz dos Santos, Jorge Dantas de Melo, Adrião Dória Duarte Neto, Daniel Aloise

Journal Paper Expert Systems with Applications, Volume 41, Issue 10, August 2014, Pages 4939--4949

Abstract

Optimization techniques known as metaheuristics have been applied successfully to solve different problems, in which their development is characterized by the appropriate selection of parameters (values) for its execution. Where the adjustment of a parameter is required, this parameter will be tested until viable results are obtained. Normally, such adjustments are made by the developer deploying the metaheuristic. The quality of the results of a test instance [The term instance is used to refer to the assignment of values to the input variables of a problem.] will not be transferred to the instances that were not tested yet and its feedback may require a slow process of “trial and error” where the algorithm has to be adjusted for a specific application. Within this context of metaheuristics the Reactive Search emerged defending the integration of machine learning within heuristic searches for solving complex optimization problems. Based in the integration that the Reactive Search proposes between machine learning and metaheuristics, emerged the idea of putting Reinforcement Learning, more specifically the Q-learning algorithm with a reactive behavior, to select which local search is the most appropriate in a given time of a search, to succeed another local search that can not improve the current solution in the VNS metaheuristic. In this work we propose a reactive implementation using Reinforcement Learning for the self-tuning of the implemented algorithm, applied to the Symmetric Travelling Salesman Problem.

Column generation bounds for numerical microaggregation

Daniel Aloise, Pierre Hansen, Caroline Rocha, Éverton Santi

Journal Paper Journal of Global Optimization, Volume 60, Issue 2, February 2014, Pages 165--182

Abstract

The biggest challenge when disclosing private data is to share information contained in databases while protecting people from being individually identified. Microaggregation is a family of methods for statistical disclosure control. The principle of microaggregation is that confidentiality rules permit the publication of individual records if they are partitioned into groups of size larger or equal to a fixed threshold value, where none is more representative than the others in the same group. The application of such rules leads to replacing individual values by those computed from small groups (microaggregates), before data publication. This work proposes a column generation algorithm for numerical microaggregation in which its pricing problem is solved by a specialized branch-and-bound. The algorithm is able to find, for the first time, lower bounds for instances of three real-world datasets commonly used in the literature. Furthermore, new best known solutions are obtained for these instances by means of a simple heuristic method with the columns generated.

Um Modelo de Simulação a Eventos Discretos para o Dimensionamento de Leitos em Hospitais de Acordo com a demanda por Especialidade Médica

Hérica Débora Praxedes de Freitas, Daniel Aloise, Werner Kleyson da Silva Soares, Woldermacdowell Alves Paquerote

Conference Paper XLVI SBPO, Salvador, BA, Brazil, September 2014, Pages 2806--2817

Abstract

The hospital is a location where complex actions are performed in order to serve the population. These actions involve medical consultations, exams, surgeries, emergency care services and hospitalizations which are mixed impatience, despair and distress of patients and their families. The amount of people requiring hospital services increase steadily while the offer of hospital beds does not grow in the same proportion. This incurs in hospital overcrowding, a decrease in the quality of care services, dissatisfaction of health professionals and difficulty in hospital management of common activities. This paper presents a model based on discrete event simulation for the management of hospital beds in accordance with the demand for medical specialty. The results demonstrated that the reallocation of resources could improve the occupancy rate of the existing beds in the hospital under study.

A Variable Neighborhood Search Matheuristic for the Heterogeneous P-Median Problem

Éverton Santi, Daniel Aloise, Simon J. Blanchard

Conference Paper XII Global Optimization Workshop, Mathematical and Applied Global Optimization, MAGO, September 2014, Pages 37--40

Abstract

The p-median problem is a model that partitions n objects into p clusters by minimizing the sum of distances from each point to the center of its cluster using, as input, a single dissimilarity matrix between the objects. This involves simultaneously selecting the p cluster centers (necessarily se- lected among the n objects) and assigning objects to clusters. In some settings, multiple dissimilarity matrices are available. Given that the p-median model uses only one matrix as input, researchers typically aggregate their data into a single matrix resulting in p-median results that mask the true nature of the data. The Heterogeneous P-Median Problem was recently proposed as an alternative model to the classical p-median model for clustering. It consists of a three-way partitioning prob- lem that identifies groups of individuals with similar category structures. In this work, we present a Variable Neighborhood Search heuristic for the problem based on the exact exploration of large neighborhoods modeled as mixed-integer programs. Preliminary computational experiments show that the heuristic is very efficient for a set of synthetic instances.

Branch-and-prune algorithm for multidimensional scaling preserving cluster partition

Jorge Alencar, Tibérius Bonates, Guilherme Liberali, Daniel Aloise

Conference Paper WORKSHOP ON DISTANCE GEOMETRY AND APPLICATIONS, 2013, Pages 41

Abstract

In standard Multidimensional Scaling (MDS) one is concerned with finding a low-dimensional representation of a set of n objects, so that pairwise dissimilarities among the original objects are represented as distances in the embedded space with minimum error. We propose an MDS algorithm that simultaneously optimizes the distance error and the cluster membership discrepancy between a given cluster structure in the original data and the resulting cluster structure in the low-dimensional representation. We report on preliminary computational experience, which shows that the algorithm is able to find MDS representations that preserve the original cluster structure while incurring a relatively small increase in the distance error, as compared to standard MDS.

A column generation heuristic for microdata protection

Éverton Santi, Daniel Aloise, Pierre Hansen, Caroline Rocha

Conference Paper GLOBAL OPTIMIZATION WORKSHOP, GOW' 12, 2012, Pages 121--124

Abstract

The biggest challenge when disclosing private data is to share information contained in databases while protecting people from being individually identified. Microaggregation is a family of methods for statistical disclosure control. The principle of microaggregation is that confidentiality rules per- mit the publication of individual records if they are partitioned into groups of g or more data, where none is more representative than the others in the same group. The application of such rules leads to replacing individual values by those computed from small groups (microaggregates), before data publication. This work proposes a column generation heuristic for numerical microdata. Compu- tational experiments show that the proposed method finds the best results for a set of benchmark instances in the literature.

Scalability Analysis of a Parallel Coupled Simulated Annealing Implementation

Kayo Gonçalves e Silva, Samuel Xavier de Souza, Daniel Aloise

Conference Paper GLOBAL OPTIMIZATION WORKSHOP, GOW' 12, 2012, Pages 129--132

Abstract

We evaluated the parallel performance of the Coupled Simulated Annealing (CSA) for unconstrained optimization. The CSA is characterized by a set of SA processes coupled via their acceptance probabilities in order to exploit information from each SA process. This paper seeks to identify the particularities of the CSA to determine its parallel scalability. We evaluate the scalability of the CSA using performance metrics for parallel systems. Experiments were performed fixing the problem dimension and fixing the number of processing cores. In a synthesized data set, the results show that the CSA behaves as a scalable algorithm, allowing a significant improvement in efficiency when the dimension of the problem increases.

A column generation algorithm for semi-supervised minimum sum-of-squares clustering

Daniel Aloise, Pierre Hansen, Caroline Rocha

Conference Paper GLOBAL OPTIMIZATION WORKSHOP, GOW' 12, 2012, Pages 19--22

Abstract

Clustering is a powerful tool for automated analysis of data. It addresses the following problem: given a set of entities find subsets, called clusters, which are homogeneous and/or well separated. In addition to the entities themselves, in many applications, information is also available regard- ing their relations in the space. Once a priori knowledge is incorporated into the clustering learn- ing process, we have a semi-supervised classification task. This work addresses a basic problem in semi-supervised clustering. It presents a column generation algorithm for minimum sum-of-squares clustering in the presence of must-link and cannot-link pairwise constraints. Computational experi- ments, including a comparison with Xia’s algorithm [10], are reported.

On the Weber facility location problem with limited distances and side constraints

Isaac F. Fernandes, Daniel Aloise, Dario J. Aloise, Pierre Hansen, Leo Liberti

Journal Paper Optimization Letters, Volume 8, Issue 2, August 2012, Pages 407--424

Abstract

The objective in the continuous facility location problem with limited distances is to minimize the sum of distance functions from the facility to the customers, but with a limit on each of the distances, after which the corresponding function becomes constant. The problem has applications in situations where the service provided by the facility is insensitive after a given threshold distance. In this paper, we propose a global optimization algorithm for the case in which there are in addition lower and upper bounds on the numbers of customers served.

The Heterogeneous P-Median Problem for Categorization Based Clustering

Simon J. Blanchard, Daniel Aloise, Wayne S. DeSarbo

Journal Paper Psychometrika, Volume 77, Issue 4, September 2012, Pages 741--762

Abstract

The p-median offers an alternative to centroid-based clustering algorithms for identifying unobserved categories. However, existing p-median formulations typically require data aggregation into a single proximity matrix, resulting in masked respondent heterogeneity. A proposed three-way formulation of the p-median problem explicitly considers heterogeneity by identifying groups of individual respondents that perceive similar category structures. Three proposed heuristics for the heterogeneous p-median (HPM) are developed and then illustrated in a consumer psychology context using a sample of undergraduate students who performed a sorting task of major U.S. retailers, as well as a through Monte Carlo analysis. (PsycINFO Database Record (c) 2016 APA, all rights reserved)

A VNS heuristic for escaping local extrema entrapment in normalized cut clustering

Pierre Hansen, Manuel Ruiz, Daniel Aloise

Journal Paper Pattern recognition, Volume 45, Issue 12, December 2012, Pages 4337--4345

Abstract

Normalized cut is one of the most popular graph clustering criteria. The main approaches proposed for its resolution are spectral clustering methods and a multilevel approach of Dhillon et al. (TPAMI 29:1944–1957, 2007), called graclus. Their aim is to obtain good solutions in a small amount of time for large instances. Metaheuristics are general frameworks for stochastic searches often employed in global optimization to improve the solutions obtained by other heuristics. Variable neighborhood search (VNS) is a metaheuristic which exploits systematically the idea of neighborhood change during the search. In this paper, we propose a VNS heuristic for normalized cut segmentation. Computational experiments show that in most cases this VNS heuristic improves significantly, and in moderate time, the solutions obtained by the current state-of-the-art algorithms, i.e., graclus and a spectral method proposed by Yu and Shi (ICCV, 2003).

An improved column generation algorithm for minimum sum-of-squares clustering

Daniel Aloise, Pierre Hansen, Leo Liberti

Journal Paper Mathematical Programming, Volume 131, Issue 1, April 2012, Pages 195--220

Abstract

Given a set of entities associated with points in Euclidean space, minimum sum-of-squares clustering (MSSC) consists in partitioning this set into clusters such that the sum of squared distances from each point to the centroid of its cluster is minimized. A column generation algorithm for MSSC was given by du Merle et al. in SIAM Journal Scientific Computing 21:1485–1505. The bottleneck of that algorithm is the resolution of the auxiliary problem of finding a column with negative reduced cost. We propose a new way to solve this auxiliary problem based on geometric arguments. This greatly improves the efficiency of the whole algorithm and leads to exact solution of instances with over 2,300 entities, i.e., more than 10 times as much as previously done.

Modularity maximization in networks by variable neighborhood search.

Daniel Aloise, Gilles Caporossi, Pierre Hansen, Leo Liberti, Sylvain Perron, Manuel Ruiz

Book Chapter Graph Partitioning and Graph Clustering, Volume 588, January 2012, Pages 113--127

Abstract

Findingcommunities,orclusters,innetworks,orgraphs,hasbeen the subject of intense studies in the last ten years. The most used criterion for that purpose, despite some recent criticism, is modularity maximization, proposed by Newman and Girvan. It consists in maximizing the sum for all clusters of the number of inner edges minus the expected number of inner edges assuming the same distribution of degrees. Numerous heuristics, as well as a few exact algorithms have been proposed to maximize modularity. We apply the Variable Neighborhood Search metaheuristic to that problem. Computational results are reported for the instances of the 10th DIMACS Im- plementation Challenge. The algorithm presented in this paper obtained the second prize in the quality modularity (sub)challenge of the referred competi- tion, finding the best known solutions for 11 out of 30 instances.

Evaluating a branch-and-bound RLT-based algorithm for minimum sum-of-squares clustering

Daniel Aloise, Pierre Hansen

Journal Paper Journal of Global Optimization, Volume 49, Issue 3, July 2011, Pages 449--465

Abstract

Minimum sum-of-squares clustering consists in partitioning a given set of n points into c clusters in order to minimize the sum of squared distances from the points to the centroid of their cluster. Recently, Sherali and Desai (JOGO, 2005) proposed a reformulation-linearization based branch-and-bound algorithm for this problem, claiming to solve instances with up to 1,000 points. In this paper, their algorithm is investigated in further detail, reproducing some of their computational experiments. However, our computational times turn out to be drastically larger. Indeed, for two data sets from the literature only instances with up to 20 points could be solved in less than 10 h of computer time. Possible reasons for this discrepancy are discussed. The effect of a symmetry breaking rule due to Plastria (EJOR, 2002) and of the introduction of valid inequalities of the convex hull of points in two dimensions which may belong to each cluster is also explored.

Adaptive memory in multistart heuristics for multicommodity network design

Daniel Aloise, Celso C. Ribeiro

Journal Paper Journal of Heuristics, Volume 17, Issue 2, March 2011, Pages 153--179

Abstract

This paper focuses on the use of different memory strategies to improve multistart methods. A network design problem in which the costs are given by discrete stepwise increasing cost functions of the capacities installed in the edges is used to illustrate the contributions of adaptive memory and vocabulary building strategies. Heuristics based on shortest path and maximum flow algorithms are combined with adaptive memory in order to obtain an approximate solution to the problem in the framework of a multistart algorithm. Furthermore, a vocabulary building intensification mechanism supported by the resolution of a linear program is also explored. Numerical experiments have shown that the proposed algorithm obtained the best known solutions for some instances in the literature. These results show the contribution of each memory component and the effectiveness of their combination.

Um modelo híbrido de seleção de fornecedores para cadeias de suprimentos

Éverton Santi, Luciano Ferreira, Caroline Thennecy de Medeiros Rocha, Daniel Aloise

Conference Paper XLIII SBPO: Simpósio Brasileiro de Pesquisa Operacional, Ubatuba, August 2011

Abstract

This work presents a hybrid approach for the supplier selection problem in Supply Chain Management. We joined decision-making philosophy by researchers from business school and researchers from engineering in order to deal with the problem more extensively. We utilized traditional multicriteria decision-making methods, like AHP and TOPSIS, in order to evaluate alternatives according decision maker s preferences. The both techiniques were modeled by using definitions from the Fuzzy Sets Theory to deal with imprecise data. Additionally, we proposed a multiobjetive GRASP algorithm to perform an order allocation procedure between all pre-selected alternatives. These alternatives must to be pre-qualified on the basis of the AHP and TOPSIS methods before entering the LCR. Our allocation procedure has presented low CPU times for five pseudorandom instances, containing up to 1000 alternatives, as well as good values for all considered objectives. This way, we consider the proposed model as appropriate to solve the supplier selection problem in the SCM context. It can be used to help decision makers in reducing lead times, cost and risks in their supply chain. The proposed model can also improve firm s efficiency in relation to business strategies, according decision makers, even when a large number of alternatives must be considered, differently from classical models in purchasing literature

Column generation algorithms for exact modularity maximization in networks

Daniel Aloise, Sonia Cafieri, Gilles Caporossi, Pierre Hansen, Sylvain Perron, Leo Liberti

Journal Paper Physical Review E, Volume 82, Issue 4, October 2010, Pages 46112

Abstract

Finding modules, or clusters, in networks currently attracts much attention in several domains. The most studied criterion for doing so, due to Newman and Girvan [Phys. Rev. E 69, 026113 (2004)], is modularity maximization. Many heuristics have been proposed for maximizing modularity and yield rapidly near optimal solution or sometimes optimal ones but without a guarantee of optimality. There are few exact algorithms, prominent among which is a paper by Xu et al. [Eur. Phys. J. B 60, 231 (2007)]. Modularity maximization can also be expressed as a clique partitioning problem and the row generation algorithm of Grötschel and Wakabayashi [Math. Program. 45, 59 (1989)] applied. We propose to extend both of these algorithms using the powerful column generation methods for linear and non linear integer programming. Performance of the four resulting algorithms is compared on problems from the literature. Instances with up to 512 entities are solved exactly. Moreover, the computing time of previously solved problems are reduced substantially.

Algorithms for network modularity maximization

Daniel Aloise, Sonia Cafieri, Gilles Caporossi, Pierre Hansen, Leo Liberti, Sylvain Perron

Conference Paper ROADEF 2010, 11ème congrès de la Société Française de Recherche Opérationnelle et d'Aide à la Décision, July 2010

Abstract

Networks are often used to represent complex systems arising in a variety of fields. Social networks model interactions among people. Telecommunication networks model communications betwen them, such as in the WorldWide Web. Transportation networks model movements of goods and passengers. Biological networks model interactions between organisms, such as in food networks. A network (or graph) G = (V,E) is composed of a set of vertices, representing the entities of the system under study, and a set of edges joining pairs of vertices and representing a relation holding for such pairs. Identifying communities, or clusters, in complex networks is a topic of particular interest and is currently a very active research domain.

A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering

Daniel Aloise, Pierre Hansen

Journal Paper Pesquisa Operacional, Volume 29, Issue 3, December 2009, Pages 503--516

Abstract

Minimum sum-of-squares clustering (MSSC) consists in partitioning a given set of n points into k clusters in order to minimize the sum of squared distances from the points to the centroid of their cluster. Recently, Peng & Xia (2005) established the equivalence between 0-1 semidefinite programming (SDP) and MSSC. In this paper, we propose a branch-and-cut algorithm for the underlying 0-1 SDP model. The algorithm obtains exact solutions for fairly large data sets with computing times comparable with those of the best exact method found in the literature.

An efficient implementation of a VNS/ILS heuristic for a real-life car sequencing problem

Celso C. Ribeiro, Daniel Aloise, Thiago F. Noronha, Caroline Rocha, Sebastián Urrutia

Journal Paper European Journal of Operational Research, Volume 191, Issue 3, December 2008, Pages 596--611

Abstract

The car sequencing problem consists in sequencing a given set of cars to be produced in a single day. We address one of the variants of this problem, in which the objective of minimizing the number of violations of assembly constraints has a stronger weight than the minimization of the number of paint color changes. We present and describe in details a VNS/ILS approach for approximately solving this problem. Computational results on real-life test instances are reported. The work presented in this paper obtained the second prize in the challenge ROADEF’2005 sponsored by Renault.

Scheduling workover rigs for onshore oil production

Dario J Aloise, Daniel Aloise, Caroline TM Rocha, Celso C Ribeiro, José C Ribeiro Filho, Luiz SS Moura

Journal Paper Discrete Applied Mathematics, Volume 154, Issue 5, April 2006, Pages 695–702

Abstract

Many oil wells in Brazilian onshore fields rely on artificial lift methods. Maintenance services such as cleaning, reinstatement, stimulation and others are essential to these wells. These services are performed by workover rigs, which are available on a limited number with respect to the number of wells demanding service. The decision of which workover rig should be sent to perform some maintenance service is based on factors such as the well production, the current location of the workover rig in relation to the demanding well, and the type of service to be performed. The problem of scheduling workover rigs consists in finding the best schedule for the available workover rigs, so as to minimize the production loss associated with the wells awaiting for service. We propose a variable neighborhood search (VNS) heuristic for this problem. Computational results on real-life problems are reported and their economic impacts are evaluated.