Similarity computation in modern applications

Computing the similarity between data items is a fundamental operation in data management. The first part of the thesis proposes a technique that is utilized in distributed applications for computing the similarity between data descriptions that are streamed from remote sources. The proposed technique utilizes prior knowledge of the data distribution in order to produce, with the same space, more accurate descriptions than the Random Hyperplane Projection (RHP) method (i.e. a locality sensitive hashing method). We present algorithms that materialize RHP dimensions that better describe the underlying data, while additional dimensions are derived in real-time using simple stored statistics we maintain. Our method is resilient to data distribution changes because of a fail-safe mechanism that detects in real time cases where particular data instances cannot be accurately described using the selected dimensions. In such cases the method automatically falls-back into using the basic RHP sch ...

Διαβάστε τη διατριβή (Online)

Περίληψη

Περίληψη

Ο υπολογισμός της ομοιότητας μεταξύ δεδομένων είναι μια θεμελιώδης λειτουργία στη διαχείριση δεδομένων. Το πρώτο μέρος της διατριβής προτείνει μια τεχνική που χρησιμοποιείται σε κατανεμημένες εφαρμογές για τον υπολογισμό της ομοιότητας μεταξύ περιγραφών δεδομένων που παράγονται από απομακρυσμένους κόμβους. Η προτεινόμενη τεχνική χρησιμοποιεί προηγούμενη γνώση της κατανομής των δεδομένων προκειμένου να παράγει, στον ίδιο χώρο, πιο ακριβείς περιγραφές από τη μέθοδο Random Hyperplane Projection (RHP). Παρουσιάζουμε αλγόριθμους που υπολογίζουν RHP διαστάσεις οι οποίες περιγράφουν καλύτερα τα δεδομένα, ενώ παράλληλα παράγονται πρόσθετες διαστάσεις σε πραγματικό χρόνο χρησιμοποιώντας απλά αποθηκευμένα στατιστικά στοιχεία που διατηρούμε. Η μέθοδος μας είναι ανθεκτική στις μεταβολές της κατανομής των δεδομένων εξαιτίας ενός μηχανισμού ασφαλούς αποτυχίας ο οποίος ανιχνεύει σε πραγματικό χρόνο περιπτώσεις όπου συγκεκριμένα υποσύνολα δεδομένων δεν μπορούn the Random Hyperplane Projection (RHP) method (i.e. a locality sensitive hashing method). We present algorithms that materialize RHP dimensions that better describe the underlying data, while additional dimensions are derived in real-time using simple stored statistics we maintain. Our method is resilient to data distribution changes because of a fail-safe mechanism that detects in real time cases where particular data instances cannot be accurately described using the selected dimensions. In such cases the method automatically falls-back into using the basic RHP scheme. The second part of this work introduces a novel user-centric approach for similarity computation between products. Unlike techniques that focus on product features our method takes into account users’ preferences by utilizing the recent concept of reverse top-k queries. In contrast to a top-k query that returns the k products with the best score for a specific customer, the result of a reverse top-k query is the set of customers(or their rankings) for whom a given product belongs to their top-k set. Two novel query types (θ-similarity and m-nearest neighbor) for user-centric similarity search are defined, while algorithms for efficient processing of these queries are also discussed. The presented algorithms prune the search space by identifying effective similarity bounds and exploit conventional multi-dimensional indexes for efficient query processing. In the third part of this thesis we address the problem of outlier detection in hierarchically organized data domains. Outlier detection is critical for modern applications such as decision support (OLAP) and network management. However, existing techniques do not take into consideration the hierarchical nature of the data domains that is inherent in such applications. The natural aggregation of atomic values along a domain hierarchy is an intuitive summarization technique that can be used to detect diferent grades of abnormal behavior by looking across all levels of the hierarchy. In this work, we first formally define the notion of a hierarchical outlier and describe an intuitive monotonicity property that permits us to rate an outlier with a single grade value. We then present a technique that combines locality sensitive hashing and clustering in order to index the data descriptions. The fusion of both techniques permits us to reduce the storage space for the index and the execution time for an outlier computation query.

περισσότερα

Διαβάστε τη διατριβή (Online)

Κατεβάστε τη διατριβή σε μορφή PDF (3.35 MB) ια τον υπολογισμό ομοιότητας με επίκεντρο τις προτιμήσεις των χρηστών, ενώ επίσης συζητούνται αλγόριθμοι για την αποτελεσματική επεξεργασία αυτών των ερωτημάτων. Οι αλγόριθμοι που παρουσιάζονται μειώνουν το χώρο αναζήτησης εντοπίζοντας αποτελεσματικά όρια ομοιότητας και εκμεταλλεύονται συμβατικά πολυδιάστατα ευρετήρια για αποτελεσματική επεξεργασία των ερωτημάτων. Στο τρίτο μέρος της διατριβής εξετάζουμε το πρόβλημα της ανίχνευσης ακραίων τιμών σε ιεραρχικά οργανωμένα σύνολα δεδομένων. Η ανίχνευση ακραίων τιμών είναι κρίσιμη για τις σύγχρονες εφαρμογές, όπως η υποστήριξη λήψης αποφάσεων (OLAP) και η διαχείρισ-4" style="padding-bottom: 0px!important;">

Δηλώνω ότι έλαβα γνώση και ανεπιφύλακτα συμφωνώ και αποδέχομαι τους Όρους Χρήσης του Εθνικού Αρχείου Διδακτορικών Διατριβών, καθώς και της

Όλα τα τεκμήρια στο ΕΑΔΔ προστατεύονται από πνευματικά δικαιώματα.

data items is a fundamental operation in data management. The first part of the thesis proposes a technique that is utilized in distributed applications for computing the similarity between data descriptions that are streamed from remote sources. The proposed technique utilizes prior knowledge of the data distribution in order to produce, with the same space, more accurate descriptions than the Random Hyperplane Projection (RHP) method (i.e. a locality sensitive hashing method). We present algorithms that materialize RHP dimensions that better describe the underlying data, while additional dimensions are derived in real-time using simple stored statistics we maintain. Our method is resilient to data distribution changes because of a fail-safe mechanism that detects in real time cases where particular data instances cannot be accurately described using the selected dimensions. In such cases the method automatically falls-back into using the basic RHP sch ...

Διαβάστε τη διατριβή (Online)

Δηλώνω ότι έλαβα γνώση και ανεπιφύλακτα συμφωνώ και αποδέχομαι τους DOI	10.12681/eadd/41418
Διεύθυνση Handle	http://hdl.handle.net/10442/hedi/41418
ND	41418Όρους Χρήσης του Εθνικού Αρχείου Διδακτορικών Διατριβών, καθώς και της

Όλα τα τεκμήρια στο ΕΑΔΔ προστατεύονται από πνευματικά δικαιώματα.

Εναλλακτικός τίτλος

Similarity computation in modern applications

Συγγραφέας

Γεωργούλας, Κωνσταντίνος (Πατρώνυμο: Θωμάς)

Ημερομηνία

2017

Ίδρυμα

Οικονομικό Πανεπιστήμιο Αθηνών. Σχολή Επιστημών και Τεχνολογίας της Πληροφορίας. Τμήμα Πληροφορικής

Εξεταστική επιτροπή

Κωτίδης Ιωάννης
Βασσάλος Βασίλειος
Δεληγιαννάκης Αντώνιος
Κωνσταντόπουλος Πάνος
Καλαμπούκης Θεόδωρος
Δουλκερίδης Χρήστος
Τσουμάκος Δημήτριος

Επιστημονικό πεδίο

Φυσικές Επιστήμες ➨ Επιστήμη Ηλεκτρονικών Υπολογιστών και Πληροφορική

Λέξεις-κλειδιά

Ομοιότητα

Χώρα

Ελλάδα

Γλώσσα

Αγγλικά

Άλλα στοιχεία

xxi, 119 σ., πιν., σχημ., γραφ.

Στατιστικά χρήσης

ΠΡΟΒΟΛΕΣ

Αφορά στις μοναδικές επισκέψεις της διδακτορικής διατριβής για την χρονική περίοδο 07/2018 - 07/2023.
Πηγή: Google Analytics.

ΞΕΦΥΛΛΙΣΜΑΤΑ

Αφορά στο άνοιγμα του online αναγνώστη για την χρονική περίοδο 07/2018 - 07/2023.
Πηγή: Google Analytics.

ΜΕΤΑΦΟΡΤΩΣΕΙΣ

Αφορά στο σύνολο των μεταφορτώσων του αρχείου της διδακτορικής διατριβής.
Πηγή: Εθνικό Αρχείο Διδακτορικών Διατριβών.

ΧΡΗΣΤΕΣ

Αφορά στους συνδεδεμένους στο σύστημα χρήστες οι οποίοι έχουν αλληλεπιδράσει με τη διδακτορική διατριβή. Ως επί το πλείστον, αφορά τις μεταφορτώσεις.
Πηγή: Εθνικό Αρχείο Διδακτορικών Διατριβών.

"Υπολογισμός ομοιότητας σε μοντέρνες εφαρμογές"
	Πληκτρολογήστε το κείμενο της εικόνας!
Δηλώνω ότι έλαβα γνώση και ανεπιφύλακτα συμφωνώ και αποδέχομαι τους Όρους Χρήσης του Εθνικού Αρχείου Διδακτορικών Διατριβών, καθώς και της .