My long-term goal is to build systems to address unsolved problems and improve life for the current and future generations.
When I was a kid, I found in computers the most promising approach to solve the problems that I perceived as important, and that eventually led me to study computer engineering, complete a Ph.D., and start working in the tech industry. I have been fortunate to work with many brigther than me, and to challenge myself often, learning a lot on the way.
Currently, I lead the Intelligent Software Systems (ISS) group at the NEC Laboratories Europe. In ISS and at NEC Labs, I am lucky to work with a formidable and diverse group of colleagues that cover a large spectrum of topics, including computer architecture and systems, machine learning, computational methods for physics and chemistry, mathematics, compiler design, high-performance computing systems, and cybersecurity. Working together, we are researching and developing:
Beyond my professional work, I am passionate about exploring new AI and software technologies through various side projects. One of my most long-lasting endeavors is AlternateDayZ.com, a multiplayer game set in a post-apocalyptic world, with automatic narrative, story and illustrations generation from the players’ game sessions.
If you find any of the above interesting, feel free to reach out to me. I am always open to new collaborations and opportunities to learn and grow together.
The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest – a framework where (i) both benchmarks and metrics are modular and easily extensible through well documented and easy-to-use APIs; (ii) we offer two new evaluation metrics that can reliably track LLM agent progress while solving a task. We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase. Together with the research community, we hope to extend AgentQuest further and therefore we make it available under https://github.com/nec-research/agentquest.
Cyber Threat Intelligence (CTI) plays a crucial role in assessing risks and enhancing security for organizations. However, the process of extracting relevant information from unstructured text sources can be expensive and time-consuming. Our empirical experience shows that existing tools for automated structured CTI extraction have performance limitations. Furthermore, the community lacks a common benchmark to quantitatively assess their performance. We fill these gaps providing a new large open benchmark dataset and aCTIon, a structured CTI information extraction tool. The dataset includes 204 real-world publicly available reports and their corresponding structured CTI information in STIX format. Our team curated the dataset involving three independent groups of CTI analysts working over the course of several months. To the best of our knowledge, this dataset is two orders of magnitude larger than previously released open source datasets. We then design aCTIon, leveraging recently introduced large language models (GPT3.5) in the context of two custom information extraction pipelines. We compare our method with 10 solutions presented in previous work, for which we develop our own implementations when open-source implementations were lacking. Our results show that aCTIon outperforms previous work for structured CTI extraction with an improvement of the F1-score from 10%points to 50%points across all tasks.
Scaling network packet processing performance to meet the increasing speed of network ports requires software programs to carefully leverage the network devices’ hardware features. This is a complex task for network programmers, who need to learn and deal with the heterogeneity of device architectures, and re-think their software to leverage them. In this paper we make first steps to reverse this design process, enabling the automatic generation of tailored hardware designs starting from a network packet processing program. We introduce eHDL, a high-level synthesis tool that automatically generates hardware pipelines from unmodified Linux’s eBPF/XDP programs. eHDL is designed to enable software developers to directly define and implement the hardware functions they need in the NIC. We prototype eHDL targeting a Xilinx Alveo U50 FPGA NIC, and evaluate it with a set of 5 eBPF/XDP programs. Our results show that the generated pipelines are efficient in terms of required hardware resources, using only 6.5%-13.3% of the FPGA, and always achieve the line rate forwarding throughput with about 1 microsecond of per-packet forwarding latency. Compared to other network-specific high-level synthesis tool, eHDL enables software programmers with no hardware expertise to describe stateful functions that operate on the entire packet data. Compared to alternative processor-based solutions that perform eBFP/XDP offloading to a NIC, eHDL provides 10-100x higher throughput.
We present an approach to improve the scalability of online machine learning-based network traffic analysis. We first make the case to replace widely-used supervised machine learning models for network traffic analysis with binary neural networks. We then introduce Neural Networks on the NIC (N3IC), a system that compiles binary neural network models into implementations that can be directly integrated in the data plane of SmartNICs. N3IC supports different hardware targets, and it generates data plane descriptions using both micro-C and P4 languages. We implement and evaluate our solution using two use cases related to traffic identification and to anomaly detection. In both cases, N3IC provides up to a 100x lower classification latency, and 1.5–7x higher throughput than state-of-the-art software-based machine learning classification systems. This is achieved by running the entire traffic analysis pipeline within the data plane of the SmartNIC, thereby completely freeing the system’s CPU from any related tasks, while forwarding traffic at line rate (40Gbps) on the target NICs. Encouraged by these results we finally present the design and FPGA-based prototype of a hardware primitive that adds binary neural network support to a NIC data plane. Our new primitive requires less than 1–2% of the logic and memory resources of a VirteX7 FPGA. We show through experimental evaluation that extending the NIC data plane enables more challenging use cases that require online traffic analysis to be performed in a few microseconds.
FPGA NICs can improve packet processing performance, however, programming them is difficult, and existing solutions to enable software packet processing on FPGA either provide limited packet processing speed, or require changes to programs and to their development/deployment life cycle. We address the issue with program warping, a new technique that improves throughput replacing several instructions of a packet processing program with an equivalent runtime programmable hardware implementation. Program warping performs static analysis of a packet processing program, described with Linux’s eBPF, to identify subsets of the program that can be implemented by an optimized FPGA pipeline, the warp engine. Packets handled by the warp engine are eventually delivered to a regular eBPF program executor, along with their program context (registers, stack), to complete execution of those program parts that cannot be efficiently pipelined. We prototype program warping on a 100Gbps FPGA NIC, extending hXDP, a state-of-the-art eBPF processor for FPGA, and measure its performance running 6 unmodified real-world eBPF programs, including deployed applications such as Katran and Suricata. Our prototype runs at 250MHz, uses less than 15% of the FPGA resources, and improves hXDP throughput by 1.2-3x in most cases, and up to 18x.
The network interface cards (NICs) of modern computers are changing to adapt to faster data rates and to help with the scaling issues of general-purpose CPU technologies. Among the ongoing innovations, the inclusion of programmable accelerators on the NIC’s data path is particularly interesting, since it provides the opportunity to offload some of the CPU’s network packet processing tasks to the accelerator. Given the strict latency constraints of packet processing tasks, accelerators are often implemented leveraging platforms such as Field-Programmable Gate Arrays (FPGAs). FPGAs can be re-programmed after deployment, to adapt to changing application requirements, and can achieve both high throughput and low latency when implementing packet processing tasks. However, they have limited resources that may need to be shared among diverse applications, and programming them is difficult and requires hardware design expertise. We present hXDP, a solution to run on FPGAs software packet processing tasks described with the eBPF technology and targeting the Linux’s eXpress Data Path. hXDP uses only a fraction of the available FPGA resources, while matching the performance of high-end CPUs. The iterative execution model of eBPF is not a good fit for FPGA accelerators. Nonetheless, we show that many of the instructions of an eBPF program can be compressed, parallelized, or completely removed, when targeting a purpose-built FPGA design, thereby significantly improving performance. We implement hXDP on an FPGA NIC and evaluate it running real-world unmodified eBPF programs. Our implementation runs at 156.25MHz and uses about 15% of the FPGA resources. Despite these modest requirements, it can run dynamically loaded programs, achieves the packet processing throughput of a high-end CPU core, and provides a 10X lower packet forwarding latency.
While monitoring system behavior to detect anomalies and failures is important, existing methods based on log-analysis can only be as good as the information contained in the logs, and other approaches that look at the OS-level software state introduce high overheads. We tackle the problem with syslrn, a system that first builds an understanding of a target system offline, and then tailors the online monitoring instrumentation based on the learned identifiers of normal behavior. While our syslrn prototype is still preliminary and lacks many features, we show in a case study for the monitoring of OpenStack failures that it can outperform state-of-the-art log-analysis systems with little overhead.
Programmable data plane technologies enable the systematic reconfiguration of the low-level processing steps applied to network packets and are key drivers toward realizing the next generation of network services and applications. This survey presents recent trends and issues in the design and implementation of programmable network devices, focusing on prominent abstractions, architectures, algorithms, and applications proposed, debated, and realized over the past years. We elaborate on the trends that led to the emergence of this technology and highlight the most important pointers from the literature, casting different taxonomies for the field, and identifying avenues for future research.
A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational graph over a set of highly optimized, batched general-purpose kernels. While this approach simplifies the kernels’ implementation for each individual accelerator, the increasing heterogeneity among accelerator architectures for HPC complicates the creation of portable and extensible libraries of such kernels. Therefore, using a generalization of the CUDA community’s warp register cache programming idiom, we propose a new programming idiom (CoRe) and a virtual architecture model (PIRCH), abstracting over SIMD and SIMT paradigms. We define and automate the mapping process from a single source to PIRCH’s intermediate representation and develop backends that issue code for three different architectures: Intel AVX512, NVIDIA GPUs, and NEC SX-Aurora. Code generated by our source-to-source compiler for batched kernels, borG, competes favorably with vendor-tuned libraries and is up to 2× faster than hand-tuned kernels across architectures.
FPGA accelerators on the NIC enable the offloading of expensive packet processing tasks from the CPU. However, FPGAs have limited resources that may need to be shared among diverse applications, and programming them is difficult. We present a solution to run Linux’s eXpress Data Path programs written in eBPF on FPGAs, using only a fraction of the available hardware resources while matching the performance of high-end CPUs. The iterative execution model of eBPF is not a good fit for FPGA accelerators. Nonetheless, we show that many of the instructions of an eBPF program can be compressed, parallelized or completely removed, when targeting a purpose-built FPGA executor, thereby significantly improving performance. We leverage that to design hXDP, which includes (i) an optimizing-compiler that parallelizes and translates eBPF bytecode to an extended eBPF Instruction-set Architecture defined by us; a (ii) soft-processor to execute such instructions on FPGA; and (iii) an FPGA-based infrastructure to provide XDP’s maps and helper functions as defined within the Linux kernel. We implement hXDP on an FPGA NIC and evaluate it running real-world unmodified eBPF programs. Our implementation is clocked at 156.25MHz, uses about 15% of the FPGA resources, and can run dynamically loaded programs. Despite these modest requirements, it achieves the packet processing throughput of a high-end CPU core and provides a 10x lower packet forwarding latency.
Programmable NICs allow for better scalability to handle growing network workloads, however, providing an expressive, yet simple, abstraction to program stateful network functions in hardware remains a research challenge. We address the problem with FlowBlaze, an open abstraction for building stateful packet processing functions in hardware. The abstraction is based on Extended Finite State Machines and introduces the explicit definition of flow state, allowing FlowBlaze to leverage flow-level parallelism. FlowBlaze is expressive, supporting a wide range of complex network functions, and easy to use, hiding low-level hardware implementation issues from the programmer. Our implementation of FlowBlaze on a NetFPGA SmartNIC achieves very low latency (on the order of a few microseconds), consumes relatively little power, can hold per-flow state for hundreds of thousands of flows and yields speeds of 40 Gb/s, allowing for even higher speeds on newer FPGA models. Both hardware and software implementations of FlowBlaze are publicly available.
We present N2Net, a system that implements binary neural networks using commodity switching chips deployed in network switches and routers. Our system shows that these devices can run simple neural network models, whose input is encoded in the network packets’ header, at packet processing speeds (billions of packets per second). Furthermore, our experience highlights that switching chips could support even more complex models, provided that some minor and cheap modifications to the chip’s design are applied. We believe N2Net provides an interesting building block for future end-to-end networked systems.
Artificial Neural Networks (NNs) play an increasingly important role in many services and applications, contributing significantly to compute infrastructures’ workloads. When used in latency sensitive services, NNs are usually processed by CPUs since using an external dedicated hardware accelerator would be inefficient. However, with growing workloads size and complexity, CPUs are hitting their computation limits, requiring the introduction of new specialized hardware accelerators tailored to the task. In this paper we analyze the option to use programmable network devices, such as Network Cards and Switches, as NN accelerators in place of purpose built dedicated hardware. To this end, in this preliminary work we analyze in depth the properties of NN processing on CPUs, derive options to efficiently split such processing, and show that programmable network devices may be a suitable engine for implementing a CPU’s NN co-processor.
Supporting programmable stateful packet forwarding functions in hardware requires a tight balance between functionality and performance. Current state-of-the-art solutions are based on a very conservative model that assumes worst-case workloads. This finally limits the programmability of the system, even if actual deployment conditions may be very different from the worst-case scenario. We use trace-based simulations to highlight the benefits of accounting for specific workload characteristics. Furthermore, we show that relatively simple additions to a switching chip design can take advantage of such characteristics. In particular, we argue that introducing stalls in the switching chip pipeline enables stateful functions to be executed in a larger but bounded time without harming the overall forwarding performance. Our results show that, in some cases, the stateful processing of a packet could use 30x the time budget provided by state of the art solutions.
In SDN, complex protocol interactions that require forging network packets are handled on the controller side. While this ensures flexibility, both performance and scalability are impacted, introducing serious concerns about the applicability of SDN at scale. To improve on these issues, without infringing the SDN principles of control and data planes separation, we propose an API for programming the generation of packets in SDN switches. Our InSP API allows a programmer to define in-switch packet generation operations, which include the specification of triggering conditions, packet’s content and forwarding actions. To validate our design, we implemented the InSP API in an OpenFlow software switch and in a controller, requiring only minor modifications. Finally, we demonstrate that the application of the InSP API, for the implementation of a typical ARP-handling use case, is beneficial for the scalability of both switches and controller.
Software-defined networking (SDN) eases network management by centralizing the control plane and separating it from the data plane. The separation of planes in SDN, however, introduces new vulnerabilities in SDN networks, since the difference in processing packets at each plane allows an adversary to fingerprint the network’s packet-forwarding logic. In this paper, we study the feasibility of fingerprinting the controller-switch interactions by a remote adversary, whose aim is to acquire knowledge about specific flow rules that are installed at the switches. This knowledge empowers the adversary with a better understanding of the network’s packet-forwarding logic and exposes the network to a number of threats. In this paper, we collect measurements from hosts located across the globe using a realistic SDN network comprising of OpenFlow hardware and software switches. We show that, by leveraging information from the RTT and packet-pair dispersion of the exchanged packets, fingerprinting attacks on SDN networks succeed with overwhelming probability. We additionally show that these attacks are not restricted to active adversaries, but can also be mounted by passive adversaries that only monitor traffic exchanged with the SDN network. Finally, we discuss the implications of these attacks on the security of SDN networks, and we present and evaluate an efficient countermeasure to strengthen SDN networks against fingerprinting. Our results demonstrate the effectiveness of our countermeasure in deterring fingerprinting attacks on SDN networks.
Network Function Virtualization is pushing network operators to deploy commodity hardware that will be used to run middlebox functionality and processing on behalf of third parties: in effect, network operators are slowly but surely becoming in-network cloud providers. The market for innetwork clouds is large, ranging from content providers, mobile applications and even end-users. We show in this paper that blindly adopting cloud technologies in the context of in-network clouds is not feasible from both the security and scalability points of view. Instead we propose In-Net, an architecture that allows untrusted endpoints as well as content-providers to deploy custom in-network processing to be run on platforms owned by network operators. In-Net relies on static analysis to allow platforms to check whether the requested processing is safe, and whether it contradicts the operator’s policies. We have implemented In-Net and tested it in the wide-area, supporting a range of use-cases that are difficult to deploy today. Our experience shows that In-Net is secure, scales to many users (thousands of clients on a single inexpensive server), allows for a wide-range of functionality, and offers benefits to end-users, network operators and content providers alike.
Over the years middleboxes have become a fundamental part of today’s networks. Despite their usefulness, they come with a number of problems, many of which arise from the fact that they are hardware-based: they are costly, difficult to manage, and their functionality is hard or impossible to change, to name a few. To address these issues, there is a recent trend towards network function virtualization (NFV), in essence proposing to turn these middleboxes into software-based, virtualized entities. Towards this goal we introduce ClickOS, a high-performance, virtualized software middlebox platform. ClickOS virtual machines are small (5MB), boot quickly (about 30 milliseconds), add little delay (45 microseconds) and over one hundred of them can be concurrently run while saturating a 10Gb pipe on a commodity server. We further implement a wide range of middleboxes including a firewall, a carrier-grade NAT and a load balancer and show that ClickOS can handle packets in the millions per second.