Desktop Search Hackfest/Day one notes

lightning talks

beagle

gmail
compatibility between extractors and similar components should be discussed about
avahi, web services provided by the daemon itself
online results shown in firefox
external devices/sources working
feature wise beagle feels they have reached a complete state
more interested in dashboard now
a masters student working on dashboard at the moment
million command line tools to proof the concept
how to tie documents and objects together (asociations) without consumin lots of processing power all the time for the creation of the associations

Strigi

ported to mac and windows
Basic requirement of kde4
modularized into 4 c++ libraries
libstreams
efficient access to files as streams
reading nested streams without overhead: no copying of intermediate data
tar, bz2, zip, rpm, deb pdf, ole2, mail, .a
jstream://
Efficient for actually getting inner contents of objects without the need to create intermediate copy of the objects
libstreamanalyzer
api for accessing metadata from streams
uses libstreams
uses 5 plugins for analyzers, each optimized for speed of specific data type
most analyzers can be used in parallel
all analyzers always run, with little overhead
index object is getting the data from each content item
analyzers also run on the embedded filescontent
plugins for the indexes, so in theory different indexes can be used
most important clucene and soprano
clucene very fast
soprano used for nepomuk for semantic storage
could use tracker as the index as well
libseachclient
socket access to strigidaemon
libstrigihtmlgui
libstrigiqtdbusclient
strigidaemon
c++
relativelhy little code in the daemon
abstract interface for communications
thread pool
initial xesam support, partially shared with pinot
POWER OF STRIGI IN IN LIBRARIES THAT CAN BE SHARED

Xesam

sd
Using mirror
beginnint - dashbar used in beginning
seemed poor to make compatibility layer for all engines (strigi, beagle, ...)
unite existing apis
hide implementation details of backends
allow simpler indexes such as grepp
support for web services such as google
keeping it simple was the leading thought
does not support everything, but instead mostly desktop and mobile devices
keep low barrier of entry
preoactive openess
community effort
history
2 apis
simple query and 'live'
simple api was too simple - so, live was the way to go
xesam
dbus search api
ontology
xml query language
user search language
full draft online - devil in the details
summarizing critic:
too complex onto
feature creeping (it's etting too complex)
not extensible enough
vulnerability to extend and embrace tactics

tracker

last half year
work has revolved around splitting daemon and indexer into separate entities
extractor process spwns and pipes data to the indexer
filters handle the files
weighting data items as 'hot' based on the headingness or boldness
daemon has proper introspection
dbus api has had only a little changes
planning to change that quite a lot to support xesam
throttling - pausing indexing with monitoring events, nice, dynamic blacklisting of files, (blacklisting of dirs?)
q: wrapper of streamanalyzer?
kde4 metadata extraction using streamanalyzers - all metadata provided with those
suggestion for traker to consider streamanalyzers
Suggestion from kubasic, jos, jamie: use a common extraction pipeing model
jos: suggestion to not to use dbus between the extractor and daemon process
daemon only there to service the data
indexer there to handle the writing of the actual data
suggestion to keep this in tips and tricks session
last issue before merge - fix jamies tst on his laptop

ontologies:

evgeny giving introduction
basic design decisions
why rdf triples
inheritance needed
gives example of hierarchy (xesam:contributor in core, client extension this as xesam:majorContributor and xesam:minorContributor)
this makes applications that do not know of the cleint extensions to be able to interoperate with the data
How many features we want to implement. 95% (but not more?)
idea to drop corner cases
not only what features we implement but also to implement those well, this will lead to change in user habits (example file navigation used to cover 95% of file location needs . after indexers, this can be dimished to 60-70% - but only if indexers work well, are reliable and easy to use)
thus, we need simple enough fw that is powerful enough, but easy enough to implement and start to use
content and source split
source is that the file is and how is it stored
content deals with documents, messages and such
for the content tree, it does not matter how it is stored
source is the place for acl
document has the created date - it is the actual created date of the document
source created date is the filesystem level creation date
also modified on source could be when e.g. acl was changed, on document tree, changed means when someone actually edited the document
evgeny gives example of a scientific paper
xesam:author links the item to xesam:contact
the xesam:contact - country could be e.g. UK
now, if we want to find list documents of people from uk
the multiple query solution for this use case can be very slow to implement if the contects contain tens of thousands of people
so, the dbus traffic is unnecessarily large
and there are many other use cases of this type of relationships that will be too difficult to implement with the current xesam model efficiently
photo annotations
defining region of a photo and who is there
thus, the annotation is complex
example - photos of old people
move toward semantic representation of data. FOF example of the photo annotations
semantic web activities going on in this area at the moment
another example: papers referring it to other papers via identifiers without directly pointing to a file
point being that the client will have to be doing the work that really makes sense to be done on the backend side
suggestion: use xesam (tm) if it correctly implements the core set of the functionality
we will need to be future proof and not stagnate and have people making hacks and workarounds
philip van hoof: standard will be successful only if the client developers take it to use
martyn: it's an iterative process
client developers are our users
using xesam needs to be easy and it needs to be able to provide easily the needed information
evgeny: how do we solve this conflict of interest: we need to define xesam in flexible enough way
evgeny: it is not simple enough for the users as it is
mikkel: two camps: semantic camp and field based camp, my opinion is more on the field based camp - thus, I don't see the needs of this type very relevant for our needs
kubasic: I agree that mutidimensional graphs not so interesting - maybe we could provide a convenience library to do this kind of depth relationships. So, perhaps xesam could just provide the field based stuff and let the wrappers worry about the more extended multidimensional use cases
evgeny: my opinion would be to make it possible for the backend to be able to do this and provide the wrapper for the simple backends to do the field too graph use cases.
maybe we could roadmap that only some parts would be available of the query language and ontology so that
kubasic: if we do paged browsing the deeper relationships will be completely awful to be implement efficiently - so most often the deeper levels should not be so needed to be done
evgeny: this need will either be implemented by the cleint or the server. use cases are there
jos: can you provide a list of items to change
evgeny: we need to decide whether we want to support these structures or not
evgeny: can provide list of drawbacks
current situation undecided what is the type of the e.g. author (reference as url or string of the author) - needs to be decided
sebastian: <missed>
jos: my suggestion - if I have author email address - then we go to the next version, where we have contacts - I will store the contact url t database
urho: suggesting inner queries to the query api
evgeny: let me give more examples
sparql - and current query language
we can replace the current query language later on
jamie: the mapping really doesn't exist on the actual contents, so the documents usually don't map the actual authors, but instead they just contain the freetext of the authors
so, as long as the applications do not contain the possibilities, the graphs won't happen
evgeny: we need to make it possible for the applications to do this, otherwise the applications wil lnever support it
sebastian: shouldn't we make the inner queries optional as well as the graph queries
intermediate solution - flat view would map queries to text and on the graph capable implementations to the object url instead
possiblity to create dummy items for each url and containing the string in that object
qubasic: this will be needing lot of more support from the backend will be needed and it will be much more complicated to do - would be more future compatible to be on the xesam
sebastian: internal stuff could create the internal representations in hackish way and just expose the data in graph way
evgeny: proposing a far reaching goal to make it easy for users in mid term future to transition to this - maybe roadmat for applications to be able to produce more complex queries
Urho: so, do you propose SPARQL?
evgeny: nope - we should be able to extend current XML
sebastian: we could create related to inner queries
mikkel: why not jus tallow the results of one query to be source for the next one
urho: so inner queries - yeah
evgeny: makes an example of the paper question in sparql
evgeny: we are not looking for using sparql, just similar functionality
evgeny: mapping sparql to sql can be done
jos: I propose that we won't implement this to 1.0
just make the ontology as future proof
make a temporary solution and future path
evgeny: list of use cases
we have no way to query - semantic desktop guys will not be able to use xesam in the way they can use nepomuk
maybe we could at least make a wrapper library?
JOS: making the proposal
usually we store the name of author in a field on document
in graph we would store it on separate object
so, propose that we do it like this now and if engine supports the denormalized model, it will do the conversion from the object to the name
and we extend the ontology the denormalized way
sebastian: we have the proper ontology in nepomuk, we could just translate that to xesam
qubasik: suggest that we make a small team to discuss this in real detail
agreed: 1st item: FAIL -

Search API (d-bus api)

Notes taking by urho
latest changes for sorting and some session properties
we simply list all the problems that people have with it
inner queries agreed to be handled separately
jamie raising issue of ranged search
philip: why don't we remove the session completely - it's completely unused
jos: i'll explain the session - it's for live queries
you can have multiple searches per session
search thus should have a separate id so that that can be closed, there is no need for the session at all
kubasic:
rob: you need to know what searches are running
mikkel: if we create the session at the time of the start
martyn: we need to have a separate id for searches
rob: we have a separate id for searches
can't we add the properties to the query itself?
go for large and infrequent messages
kubasik: we should decide whether to use the properties on the xml or on the dbus
mikkel: properties can go on the search object instead (written)
philip: I agree with sebastian (sebastian) hit fields should be in the query
philip: we can go instead for the session changes to be for the xesam 2.0 and not yet to the 1.0
philip: if we change the query language, that'll break everything anyway
rob: seems like you are using ids for the objects on the server side - perhaps they actually should be d-bus objects
jos: so, current model is imperfect, but enough for 1.0
jos: let's keep it like it is for now as it's works
versioning discussion - should dbus have the version number in the dbus namespace?
rob checks- yes you can
kubasik: we need to have a bit of better way to decide which engine to use
but we don't need to specify the way to do it
philip: we should have announcement of capabilities to allow applications to choose the right engine with the right capabilities
kubasik: distributions need to be taken into account here
mikkel: we need to have the documentation about the capabilities and make the apis to discover
mikkel: can be considered for 1.0 (at least guidelines)
let's make a group to specify this
jos: gethitdata will be capable of handling the ranges quite easily as it's capable
urho: we have the case where hitsadded hit is on a sorted live query and on place on the visible resultset. without the id and uri of the hitadded, you will have to re-query your entire set
kubasik: most of the time the issue is really that single item has been modified
mikkel: so, we would emit: livehitsadded
jos: live search never sends researchdone
mikkel: yes it sends once the indexing is done
mikkel: let's add livesearch added that would contain also the hit metadata
jos: we need to minimize the changes needed
everyone can live with even the paging
jamie: so, no changes for 1.0 for hitsadded, but for 2.0 yes
jos: idea for ranged search will mean that we supply the ids of the hits
michael: d-bus api
sometimes several search engines in parallel is an issue
one engine refuses to start if d-bus api not available
we should make search engines to que on d-bus
no way to decide which engine is started
mikkel: we have also xesam interface on tracker, so that can be used org.freedesktop.tracker.xesam?
philip: we don't
mikkel: perhpas that would be a recommendation
urho: how about org.freedesktop.xesam.tracker and org.freedesktop.beagle
mikkel: good idea
michael raising issues about the specialist engines existing on the same system
philip: wants to raise something
this suggestion is unimplementable on sqlite as it is
ottela: we get hits 1 - 1000 - we change the value of item 1 to be last on the list
ottela: if one file changes - hitsmodified tells that one changed to 1000 and 2 changed to 1, 3 to 2,...
jos: why don't we say the gethitsdata needs to be sequential
philip: the issues with 1, 10000, 500 will mean the complexity of the corner cases will become very difficulta
and this will make it quite complex to do properly on the server side
<discussion that has been too hard to type on the report>
urho: how about adding getrangedhits
jos: if we don't remove the gethisdata that already exists
agreed by everyone
martyn: I'd rather have a new hitsmoved rather than having the hitsremoved and hitsadded
jos: currently the positionshould not be changing?
kubasik explains beagle issues
martyn: sounds nice, but the issue is that the new api would make it more clear for the client applications to use
kubasik: it's going to be difficult to implement the live queries
martyn: surely we should add the possibility to handle the things without the need to re-create the live query
sebastian: how about another working group?
mikkel agreed:
so, group to discuss the live sorting and hits added / moved issue
kubasik: we'll have the group and I can tell how beagle is doing this
mikkel: one for this and another group for the accessibility
sebastian proposes to use floating points for the ordering
ben raises the issue that how hard will it be possible to sort by the strings, the text
Mikkel: live query sorting who is the group?
Jos: I don't think live queries will be efficiently possible on the server side
group: half and hour - jos, martyn, kubasik, mikael, mikkel (?)
LUNCH

Ontology discussion: Link by id - mikkel presenting on postits

evgenys idea originally as a middle way between semantic graph and the current xesam
problem is mainly linking relate objects (like linkin document and author) or document contained in an email
referring e.g. two documents together with isbn
there are multiple ids (not just uris)
they will need to be 'sufficiently unique'
Solving the problem
xesam id field (any child of it must have a value that is unique over all xesam ids)
all of the links must always be to ids
we make a new special kind of field that is child of xesam:related or could be a special data type
should always hold xesam id
with this we can find all objects that point to X (or any that X points to)
so, e.g. pdf file reference isbn - query xesam id field with that isbn
sebastian: subqueries should be enough to allow good usage of this on the search api
urho: agreed
Everybody agrees to investigate the idea
mikkel: suggest using a new field in xesam api id-field
kubasik: we would need to do nested queries most probably and not probably an optimized index for this
sebastian: normally you would say give me all emails htat have sender of contact object that has name like this
now you could do it sender as a range, of person object, it would check all xesam id values in the relation
mikkel: it shoud be very easy to do these queries
and it should be very easy to show the values of these fields
jos: we should do so that we do two fields: radable valua and pointer/relation property. Also, the more strong engines could say that the value is just an alias of the actual objects relations property
sebastian: why not use the uri for data representation plugins that would then be able to create the visible widgets of the data items
sebastian: i see that there are some issues with people wanting to not to do two queries
evgeny: you would need inference for it and it's easier so that we make it so that xesam:strings: would contain the string representation of the url / link values.
so, this way, we only use the string representation for only full text search.
evgeny: I feel that we have a bit of problem on the specification of what should be in the FTI
Anders: we need to also be able to show how/why the relation was created and to be able to do that for all the related content
evgeny: we cannot have the stuff linking to everything on the database
sebastian: we don't need to present on how a relation was found and it's a GUI thing
jamie: two fields would be more clean
jamie draws example below evgenys example
jos agrees on the two field model
evgeny isn't completely convinced
the point is that the UI will likely match the actual contact
jamie: this way the application would have the choice
jamie: if we have just a name on the author text, then we don't want to try to create the dummy object
evgeny: compromise solution in my opinion is that if we have to use have use the subqueries
discussion on that we want a value of properties on a single query on xesam
jamie: we could always deprecate the fields
jos: i liked the suggestion of author text and author link and group the fields on a string text
sebastian: why not make all dummy objects to have the same data layout
at least the data is syntatically correct if we use dummy objects
jamie: it's not worth it if we don't have the semantic info
sebastian: this way we handle the data in same way for the 'proper' objects as well as the dummy ones
<lots of discussion>
Conclusion:
We make some older strings as URIs instead of the previous string values
All new URI properties also have a .label that allows to get the textual representation of the property, which in essence is the old xesam ontology value of the property. The ideal is that this is a pointer to a specific property of the linked object
We will draft nested queries and they will most probably go to 1.1

Retrieved from "https://wiki.maemo.org/Desktop_Search_Hackfest/Day_one_notes"

Categories: Community | Development

This page was last modified on 9 November 2009, at 13:51.
This page has been accessed 20,612 times.

Personal tools

Navigation

Views

Desktop Search Hackfest/Day one notes