This is a CIFRE thesis with Orange, on the socio-transactional data of the Orange Money service. The aim of the thesis is to exploit machine learning techniques for graph embedding. The end goal is to enumerate groups of interests similar to our examples.
With the increase in the number of smartphones and the low level of banking in Africa, mobile payment services are experiencing significant growth. Traditional fraud detection methods are no longer sufficient. Implementing graph analysis on transactional data can identify suspicious patterns of behavior, not only preventing financial losses, but also saving considerable resources.
My thesis focuses on the enumeration of sub-graphs of interest (SGIs), which can reveal specific patterns of behavior. The main objective of this thesis work is to help Orange use this notion of SGI to designate groups of users involved in fraudulent activities. However, several constraints must be taken into account in this context:
- Limited examples: We have very few examples of user groups involved in fraudulent activities, which makes it difficult to learn and detect these structures;
- Data volume: Orange manages millions of users and billions of transactions, which requires efficient methods for handling such large volumes of data;
- Speed of execution: The detection process must be rapid in order to minimize potential financial losses due to fraud.
To detect these SGIs, community detection algorithms are used. These identify groups of highly interconnected nodes. These detected communities are then sorted to retain only those most likely to correspond to an SGI. This selection is based on a characterization of the communities and a cosine distance, which measures the similarity between vectors representing community characteristics. Confidentiality is also a constraint, particularly with regard to the General Data Protection Regulation (GDPR). Because of this regulation, we don't have access to actual user data. To get around this problem, we have generated synthetic datasets that mimic human behavior, enabling us to simulate transactions in a banking service. Thanks to these datasets, we know the ground truth, enabling us to evaluate the effectiveness of our SGI enumeration methods.
The evaluation of the results is based on the comparison and establishment of a possible correspondence between two sub-graphs. In an industrial context, it is not necessary to have an exact match between sub-graphs. What we are looking for is an indication of minimal proximity to our SGI examples, so that a fraud expert can examine suspicious cases. To this end, a margin of error is defined in the form of three thresholds:
- Missing nodes: The SGI must not miss an excessive number of nodes compared to the SGI examples;
- Additional nodes: The SGI must not contain an excessive number of additional nodes that do not correspond to the SGI examples;
- Appropriate size: The size of the detected SGI should be comparable to that of the examples.
These thresholds will make it possible to filter results and provide useful indications to fraud experts, while respecting confidentiality constraints and taking into account the limitations of available data.