Algoritmos de machine learning para la detección del fraude en el seguro de automóviles

Elena Badal Valero; Andrés Sanjuán Díaz; Jorge Segura Gisbert

doi:10.26360/2020_2

Authors

Elena Badal Valero Universidad de Valencia (España)
Andrés Sanjuán Díaz Universidad de Valencia (España)
Jorge Segura Gisbert Universidad de Valencia (España)

DOI:

https://doi.org/10.26360/2020_2

Keywords:

fraud, machine learning, insurance car, risk

Abstract

Automobile insurance fraud has increased considerably in recent years, undoubtedly boosted by the economic crisis. This significant increase in fraudulent files and the new requirements of the regulations associated with Solvency II lead to greater control and allocation of resources against fraud by insurers. For these reasons, the importance of the use of prediction techniques for the detection of suspicious accidents is more than justified. In this paper, we present several methodologies with statistical foundation and automatic learning algorithms that enable the analysis and detection of such claims.

Downloads

Download data is not yet available.

References

Artís, M., Ayuso, M., Guillen, M. (1999). Técnicas cuantitativas para la detección del fraude en el seguro del automóvil. Anales del Instituto de Actuarios Españoles, 5, 51-84.

ASEPEG (2020).Glosario de Términos. En https://www.apeseg.org.pe/glosario-de-terminos/

Ayuso, M., Guillén, M. (1999). Modelos de deteccion de fraude en el seguro de automóvil, Cuadernos Actuariales, 8, 135-149.

Badal-Valero E., Alvarez-Jareño, J.A. y Pavía, J.M. (2018). Combining Benford's Law and Machine Learning to detect Money Laundering. An actual Spanish court case, Forensic Science International, 282, 24-34.

Belhadji, B., Dionne, G. (1997). Development of an expert system for automatic detection of automobile insurance fraud. Working Paper 97-06. École des Hautes Études Commerciales. Université de Montréal.

Ben-Hur, A., Horn, D., Siegelmann, H., Vapnik, V. (2001). Support Vector Clustering. Journal of Machine Learning Research. 2. 125-137.

Bolton, R.J. y Hand, D.J. (2002). Statistical Fraud Detection: A Review. Statistical Science, 17(13), 235-255.

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

Brockett, P.L, Xia, X y Derrig, R. (1995).Using Kohonen’s Self-Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud. Journal of Risk and Insurance, 65(2), 245-274.

Burez, J. y Van den Poel, D. (2009). Handling class imbalance in customer churn prediction. Experts Systems with Applications, 36, 4626-4636.

Cestnik, B, Kononenko, I, Bratko, I. (1987). A knowledge elicitation tool for sophisticated users. Progress in Machine Learning, 31-45, Sigma Press.

Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.

Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook, 853-867. Springer US.

Chen, C.; Liaw, A. y Breiman, L. (2004). Using random forest to learn imbalanced data. Technical Report 666. Statistics Department of University of California at Berkley.

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H.; Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M.,L Geng, Y. y Li, Y. (2018). xgboost: Extreme Gradient Boosting. R package version 0.71.2. https://CRAN.R-project.org/package=xgboost

Crocker, K.J. y S. Tennyson (2002). Insurance Fraud and Optimal Claims Settlement Strategies. Journal of Law & Economics, 45(2), 469-507.

Cummins, J.D. y Tennyson, S. (1996). Moral Hazard in Insurance Claiming: Evidence from Automobile Insurance. Journal of Risk and Uncertainty, 12 (1), 29-50.

Derrig, R.A y Ostaszewski, K.M. (1995). Fuzzy Techniques of Pattern Recognition in Risk and Claim Classification, Journal of Risk and Insurance, 62(3), 447-482.

Friedman, Jerome H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.

Guo, H., Li, Y., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications.Expert Syst. Appl., 73:220–239.

Hastie T, Rosset S, Tibshirani R, Zhu J (2004). The Entire Regularization Path for the Support Vector Machine. Journal of Machine Learning Research, 5, 1391–1415.

He, H. y Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.

Hidalgo Ruiz-Capillas, S (2014). Random Forests para detección de fraude en medios de pago. Trabajo Final de Máster. Universidad Autónoma de Madrid.

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79 (8), 2554-2558.

Huigevoort, Chantine (2015). Customer Churn prediction for an insurance company. Eindhoven University of Technology. Master Thesis. https://pure.tue.nl/ws/portalfiles/portal/47019808 [Último acceso: 23 de septiembre de 2018]

ICEA (2018). El Fraude al Seguro Español. Estadística a diciembre. Año 2017. Madrid, España.

Kaymak, U.; Ben-David, A. y Potharst, R. (2012). The AUK: A simple alternative to the AUC. Engineering Applications of Artificial Intelligence, 25(5), pp. 1082-1089.

Karatzoglou, A., Meyer, D. y Hornik, K. (2006). Support Vector Machines in R. Journal of Statistical Software, 15 (i09).

Keramati, A., Jafari-Marandi, R., Aliannejadi, M., Ahmadian, I., Mozaffari, M., y Abbasi, U. (2014). Improved churn prediction in telecommunication industry using data mining techniques. Applied Soft Computing, 24, pp. 994-1012.

Kohavi, R., y F. Provost (1998) On Applied Research in Machine Learning. In Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Columbia. University, New York, 30.

Kursa, M.B. y Rudnicki, W.R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), 1-13. URL: http://www.jstatsoft.org/v36/i11/

Liaw, A. y Wiener M. (2002). Classification and Regression by Random Forest. R News 2(3), pp 18-22.

Lunardon, N., Menardi, G. y Torelli, N. (2014). ROSE: a Package for Binary Imbalanced Learning. The R Journal, 6(1), 82-92.

López, V., Fernández, A., García, S. Palade, V. y Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences. 250, 113-141.

Picard, P. (2000). Economic analysis of insurance fraud. Handbook of insurance. 315-362. Springer, Dordrecht.

Van Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M. y Baesesns, B. (2017). Network-based Fraud Detection for Social Security Fraud. Management Science, 63(9), 3090-3110.

Shmueli, G.; Patel, N.R. y Bruce, P.C. (2011). Data mining for business intelligence: concepts, techniques, and applications in microsoft office excel with xlminer. John Wiley and Sons, second edition.

Silver, N. (2014). La Señal y el Ruido. Ediciones Península, Barcelona.

Swets, J.A. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293.

Therneau, T. y Atkinson, B. (2018). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-13. URL: https://CRAN.R-project.org/package=rpart

Yen, S.J y Lee, Y.S (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36 (3), 5718-5727.